Soma-notes - User contributions [en]

COMP 3000 Essay 2 2010 Question 8

2010-12-02T16:21:33Z

Sliske: /* Additional questions */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head figure is obtained by using a CPU-bound benchmark, and while highly accurate for the scope it is tested in, the performance loss is not necessarily noticed by the end user.

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process communications. This also allows them to modify variable taint tags when a method call returns, so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which allows persistant content to keep its taint marks between sessions.

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Provide a brief description of information flow and taint analysis.
** Information flow is the transfer of information between variables, methods, processes, and files. There are two types of information flow: implicit and explicit. Explicit flow is the direct transfer of data that results in it being more accessible than originally intended. Implicit flow refers to the ability to derive information that is supposed to be kept private. Taint analysis attempts to track information flow in order to better understand possible security issues. There are two types of taint analysis: static, which maps all possible paths of a program; and dynamic, which attempts to follow information as it's transferred in real time. Both can follow both implicit and explicit information flow, however there is a significant run-time disadvantage in tracking implicit flow in dynamic environments, so dynamic taint analysis is often done through emulation. (Background Concepts)
* How is TaintDroid different from previous taint analysis programs? What are some problems specific to the TaintDroid implementation?
** While dynamic analysis has been done before in many contexts, TaintDroid is one of the first to attempt to do dynamic analysis on a live embedded system with resource constraints, and so has some unique concerns. The most specific is surely the fact that smart phones are resource constrained. Preforming taint analysis without using emulation requires an efficient, low-overhead implementation, or the experiment will grind to a halt. The next largest issue is working with the existing software. TaintDroid needs to go low-level enough in the Android system to see everything the applications may possibly do, and also needs to interpret what the applications on the device are doing with the data, without being able to see the application's source. Since applications are "black-boxes", data may not look the same coming out as going in, and to get around this you must work at a level lower than the applications.(Research Problem)
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis, or an alternate dynamic analysis implementation [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].

COMP 3000 Essay 2 2010 Question 8

2010-12-02T06:24:02Z

Sliske: /* Additional questions */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head figure is obtained by using a CPU-bound benchmark, and while highly accurate for the scope it is tested in, the performance loss is not necessarily noticed by the end user.

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process communications. This also allows them to modify variable taint tags when a method call returns, so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which allows persistant content to keep its taint marks between sessions.

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Provide a brief description of information flow and taint analysis.
** Information flow is the transfer of information between variables, methods, processes, and files. There are two types of information flow: implicit and explicit. Explicit flow is the direct transfer of data that results in it being more accessible than originally intended. Implicit flow refers to the ability to derive information that is supposed to be kept private. Taint analysis attempts to track information flow in order to better understand possible security issues. There are two types of taint analysis: static, which maps all possible paths of a program ; and dynamic, which attempts to follow information as it's transferred in real time. Both can follow both implicit and explicit information flow, however there is a significant run-time disadvantage in tracking implicit flow in dynamic environments, so dynamic taint analysis is often done through emulation. (Background Concepts)
* How is TaintDroid different from previous taint analysis programs? What are some problems specific to the TaintDroid implementation?
** While dynamic analysis has been done before in many contexts, TaintDroid is one of the first to attempt to do dynamic analysis on a live embedded system with resource constraints, and so has some unique concerns. The most specific is surely the fact that smart phones are resource constrained. Preforming taint analysis without using emulation requires an efficient, low-overhead implementation, or the experiment will grind to a halt. The next largest issue is working with the existing software. TaintDroid needs to go low-level enough in the Android system to see everything the applications may possibly do, and also needs to interpret what the applications on the device are doing with the data, without being able to see the application's source. Since applications are "black-boxes", data may not look the same coming out as going in, and to get around this you must work at a level lower than the applications.(Research Problem)
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].

Talk:COMP 3000 Essay 2 2010 Question 8

2010-12-02T06:09:59Z

Sliske: /* Work Plan */

Group Members

Trevor Bonesaw Malone - tmalone@connect.carleton.ca //FIRST POST!

Qi Zhang - qzhang13@connect.carleton.ca

Gregory Bint - gbint@connect.carleton.ca

Gautam Akiwate - gakiwate@connect.carleton.ca

Corey Ling - cling@connect.carleton.ca

Sarah Liske

== Work Plan ==

As Trevor intimated, we should have clear division of work going forward. This is sort of the break down as I see it. Please edit as you think of new ideas!

* Background Concepts
** Information Flow Theory. (Implicit and Explicit Flows.) --Done[--[[User:Gautam|Gautam]] 03:54, 28 November 2010 (UTC)]
** What is dynamic taint analysis --Done[--[[User:Gautam|Gautam]] 05:07, 28 November 2010 (UTC)]
** What is the difference between dynamic and static analysis --Done[--[[User:Gautam|Gautam]] 03:54, 30 November 2010 (UTC)]]
* Research Problem
** How do we build a DTA engine for a phone? - done, but by who?
** Why do we want to? (information misuse) - done, but by who?
* Contribution
** How did they implement their DTA engine (Done: --[[User:Cling|Cling]] 04:50, 26 November 2010 (UTC))
** What did they find about information misuse (Done: --[[User:Cling|Cling]] 04:50, 26 November 2010 (UTC))
** Compared to the existing taint tracking approaches. [[User:Zhangqi|Zhangqi]] 07:11, 27 November 2010 (UTC)
** (What else should be in the contributions? Anything need fleshing out?) (Working on that now :) ) sliske
* Critique
**Added two paragraphs at the end of the present critique. Please incorporate it into your content as you deem fit.--[[User:Gautam|Gautam]] 09:07, 30 November 2010 (UTC)
**^ done. fleshed out critique, and added a bit about how taintdroid doesn't track implicit flow. Also reworded (the entire essay) for clarity where necessary/checked spelling. It would be a good idea for everyone to read it over once for spelling/clarity before thursday, just in case something doesn't make sense - sliske
* References
** The article has 61 references! We can probably use some of them
**whee! reading papers and sticking in information as need be.
**references added and citations -taken care of- were removed/reworked, as it says in the assignment guidelines they're not allowed. will go over fill in a few places where information may be lacking after class sliske
**Referencing is a little askew. The numbers don't match the papers as listed in the referencing. Also the papers are usually cited with a number and enclosed in "[]"
**thanks for giving the paper a read over/noticing that :)

List of information we need to find external sources for:
* History of taint analysis
* History of privacy research relating to smart phones

== Work In Progress ==

Log what you are working on *right now* so that other people don't try to do the same thing. Make sure to clear your name from here when you are done.

* Gregory Bint: Research Problem

* Gautam Akiwate: Background Concepts
** Any resources on Dynamic taint Analysis would be appreciated!

* Qi Zhang, Corey Ling: Contributions

* Trevor Malone: Critique

* Sarah Liske: References and Questions, Clarity/Spelling.

== Some Notes from the Video ==

Tracking of privacy sensitive data through Dynamic Taint Analysis (aka. Taint Tracking). The trick is to mark private data as it sourced, and then follow those marks until (unless) they leave the phone.

Android phones run Java apps, which are compiled into DEX, and then run on top of the Dalvik VM. It is this VM that we modify so that we can support the storage and tracking of taint tags.

Taint sources
* low -bandwidth sensors
** Location
** Accelerometer
* High-bandwidth sensors
** Mic
** Camera
* Information DB
** Address book
** SMS storage
* Device ID
** IMEI
** IMSI (don't actually track this one because of false positives)
** ICC_ID
** Phone Number

Taint sink (where marked data can leave the phone)
* Network Taint Sink

Taint propagation
* ???

Taint tags are stored in memory interleaved with the variables they are tracking

Some standard Data Flow technique is used to propagate these tags, especially as one variable that is marked may be assigned to another, so now that variable needs to be tracked as well.

Tracks explicit flows of data, not implicit
To fully capture implicit flows, you need to do static analysis, which is hard with closed-source apps, and cannot be done real-time

Implicit flows are not tracked
* Implicit flows can involve "taint-scope", tracking based on conditionals in code

=== Performance ===

The goal is to create a real time tracking system, so the TaintDroid's performance impact is of some importance

14% CPU overhead
4.4% memory overhead

Macro benchmarks (to get a feel for what the phone's usability is like with TD running)
* App load: 3% (2ms)

=== Findings ===

20 out of 30 tested applications share data in a way that is not expected.

67 of 105 flagged pieces of data leaving the device had no obviously legitimate purpose (verified by the authors).

Many apps sent location data and other unique identifiers to advertising servers.

Most apps do not mention anything to the user.

=== Limitations ===

Tracks only explicit data flows.

An application *could* launder the tags off of the data, if they really wanted to hide this sort of thing from TaintDroid.

There are methods that could be used to protect against this, but they go against the goal of a light-weight, real-time tracking system. TD is not necessarily about catching truly malicious programs, but rather just those that leak information.

Why do apps take this information?
* Lazy; in the demo video, the wallpaper app seems to use the IMEI just as a ready made unique ID
* Overzealous; the developer might thing they *need* the data for something, but actually
* Ads; advertises do seem a little presumptuous in their data collection
* Spying; bosses or spouses
* Malicious;

=== QA Period ===

Q: how do we prevent a malicious app from removing a taint attribute on a file

A: TD operates a too low a level for this to be a problem; TD assumes that the native code is trusted

Q: It seems like you had a lot of false positives

A: The point of this tool was to identify privacy sensitive information as having left the phone, not whether or not a privacy violation has taken place.

Q: Now that TD is released; couldn't malicious apps use some of the methods described in the paper to get around it?

A: Well, yes, but it is not just about maliciousness, it could just laziness or over-zealous ad stuff.

==Other Information==

Hey guys, thought I would just post a generalized paragraph about our essay.

In today’s society, Smartphones are the new big thing. To me that’s what makes this paper so interesting. This paper focuses on private information in android phones and the misuse of this information. The misuse of information includes the SIM card, the ID of the device, or the phone number. TaintDroid is used on smart phones with an efficient taint tracking and analysis system. It has the ability to track sensitive data from multiple sources and examines the misuse of such data. In their study, out of 80 popular third-party applications, TaintDroid monitored that 68 applications had potential misuse of user’s private data. This tool is great for knowing with applications are safe and which are not, so your private data can remained private.

Also, we should really think of splitting up the work in some way. If some people have specific sections they would like to do lets figure that out now so we can divide the workload and get it done over the next couple of days. I don't personally care what part I'm going to have to do, so lets get this going. Any other information people wanna post feel free the more the better, even if we don't end up using it.

[[user:Tmalone|Trevor Malone]]

Hey guys! Anything else we need to get done? Let me know and I can help in anyway possible.

[[user:Tmalone|Trevor Malone]]

==Relevant Sources==
*NEWSOME,J.,AND SONG,D.Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection]
Seems to be THE Dynamic Taint Analysis Paper.Talks about implementation on TaintCheck. Could be also useful for critique section -[Gautam]

COMP 3000 Essay 2 2010 Question 8

2010-12-02T06:09:04Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head figure is obtained by using a CPU-bound benchmark, and while highly accurate for the scope it is tested in, the performance loss is not necessarily noticed by the end user.

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process communications. This also allows them to modify variable taint tags when a method call returns, so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which allows persistant content to keep its taint marks between sessions.

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Provide a brief description of information flow and taint analysis.
** Information flow is the transfer of information between variables, methods, processes, and files. There are two types of information flow: implicit and explicit. Explicit flow is the direct transfer of data that results in it being more accessible than originally intended. Implicit flow refers to the ability to derive information that is supposed to be kept private. Taint analysis attempts to track information flow in order to better understand possible security issues. There are two types of taint analysis: static, which maps all possible paths of a program ; and dynamic, which attempts to follow information as it's transferred in real time. Both can follow both implicit and explicit information flow, however there is a significant run-time disadvantage in tracking implicit flow in dynamic environments, so dynamic taint analysis is often done through emulation. (Background Concepts)
* How is TaintDroid different from previous taint analysis programs? How does it achieve these goals?
** While dynamic analysis has been done before in many contexts, TaintDroid is one of the first to attempt to do dynamic analysis on a live embedded system with resource constraints, and so a lot of effort is put into reducing overhead. (Contribution)
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].

COMP 3000 Essay 2 2010 Question 8

2010-12-02T06:01:53Z

Sliske: /* Additional questions */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Provide a brief description of information flow and taint analysis.
** Information flow is the transfer of information between variables, methods, processes, and files. There are two types of information flow: implicit and explicit. Explicit flow is the direct transfer of data that results in it being more accessible than originally intended. Implicit flow refers to the ability to derive information that is supposed to be kept private. Taint analysis attempts to track information flow in order to better understand possible security issues. There are two types of taint analysis: static, which maps all possible paths of a program ; and dynamic, which attempts to follow information as it's transferred in real time. Both can follow both implicit and explicit information flow, however there is a significant run-time disadvantage in tracking implicit flow in dynamic environments, so dynamic taint analysis is often done through emulation. (Background Concepts)
* How is TaintDroid different from previous taint analysis programs? How does it achieve these goals?
** While dynamic analysis has been done before in many contexts, TaintDroid is one of the first to attempt to do dynamic analysis on a live embedded system with resource constraints, and so a lot of effort is put into reducing overhead. (Contribution)
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].

COMP 3000 Essay 2 2010 Question 8

2010-12-02T06:01:32Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T06:00:43Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:29:40Z

Sliske: /* Implicit Flow */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since the application can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:23:32Z

Sliske: /* Critique */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, pre-compiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user; TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated environment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of malicious applications. This would allow TaintDroid to be used as a black box.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:19:24Z

Sliske: /* Critique */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. This implementation choice was reasonable for the research project TaintDroid is, but taint analysis is (hopefully) of high importance to the everyday user, and TaintDroid could have aimed to go further than research.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:11:45Z

Sliske: /* Background Concepts */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper: 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. Essentially, this to make an arguement that TaintDroid could have been made with a set of better design choices.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

Talk:COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:11:17Z

Sliske: /* Work Plan */

Group Members

Trevor Bonesaw Malone - tmalone@connect.carleton.ca //FIRST POST!

Qi Zhang - qzhang13@connect.carleton.ca

Gregory Bint - gbint@connect.carleton.ca

Gautam Akiwate - gakiwate@connect.carleton.ca

Corey Ling - cling@connect.carleton.ca

Sarah Liske

== Work Plan ==

As Trevor intimated, we should have clear division of work going forward. This is sort of the break down as I see it. Please edit as you think of new ideas!

* Background Concepts
** Information Flow Theory. (Implicit and Explicit Flows.) --Done[--[[User:Gautam|Gautam]] 03:54, 28 November 2010 (UTC)]
** What is dynamic taint analysis --Done[--[[User:Gautam|Gautam]] 05:07, 28 November 2010 (UTC)]
** What is the difference between dynamic and static analysis --Done[--[[User:Gautam|Gautam]] 03:54, 30 November 2010 (UTC)]]
* Research Problem
** How do we build a DTA engine for a phone? - done, but by who?
** Why do we want to? (information misuse) - done, but by who?
* Contribution
** How did they implement their DTA engine (Done: --[[User:Cling|Cling]] 04:50, 26 November 2010 (UTC))
** What did they find about information misuse (Done: --[[User:Cling|Cling]] 04:50, 26 November 2010 (UTC))
** Compared to the existing taint tracking approaches. [[User:Zhangqi|Zhangqi]] 07:11, 27 November 2010 (UTC)
** (What else should be in the contributions? Anything need fleshing out?) (Working on that now :) ) sliske
* Critique
**Added two paragraphs at the end of the present critique. Please incorporate it into your content as you deem fit.--[[User:Gautam|Gautam]] 09:07, 30 November 2010 (UTC)
**^ done. fleshed out critique, and added a bit about how taintdroid doesn't track implicit flow. Also reworded (the entire essay) for clarity where necessary/checked spelling. It would be a good idea for everyone to read it over once for spelling/clarity before thursday, just in case something doesn't make sense - sliske
* References
** The article has 61 references! We can probably use some of them
**whee! reading papers and sticking in information as need be. Also working out how to cite properly, as there are two citations used currently
**references added and citations taken care of. will go over fill in a few places where information may be lacking after class sliske
**Referencing is a little askew. The numbers don't match the papers as listed in the referencing. Also the papers are usually cited with a number and enclosed in "[]"
**thanks for giving the paper a read over/noticing that :)

List of information we need to find external sources for:
* History of taint analysis
* History of privacy research relating to smart phones

== Work In Progress ==

Log what you are working on *right now* so that other people don't try to do the same thing. Make sure to clear your name from here when you are done.

* Gregory Bint: Research Problem
** Need to find some history on smart phone security research for the second part.

* Gautam Akiwate: Background Concepts
** Any resources on Dynamic taint Analysis would be appreciated!

* Qi Zhang, Corey Ling: Contributions

* Trevor Malone: Critique

* Sarah Liske: References and Questions, Clarity/Spelling.

== Some Notes from the Video ==

Tracking of privacy sensitive data through Dynamic Taint Analysis (aka. Taint Tracking). The trick is to mark private data as it sourced, and then follow those marks until (unless) they leave the phone.

Android phones run Java apps, which are compiled into DEX, and then run on top of the Dalvik VM. It is this VM that we modify so that we can support the storage and tracking of taint tags.

Taint sources
* low -bandwidth sensors
** Location
** Accelerometer
* High-bandwidth sensors
** Mic
** Camera
* Information DB
** Address book
** SMS storage
* Device ID
** IMEI
** IMSI (don't actually track this one because of false positives)
** ICC_ID
** Phone Number

Taint sink (where marked data can leave the phone)
* Network Taint Sink

Taint propagation
* ???

Taint tags are stored in memory interleaved with the variables they are tracking

Some standard Data Flow technique is used to propagate these tags, especially as one variable that is marked may be assigned to another, so now that variable needs to be tracked as well.

Tracks explicit flows of data, not implicit
To fully capture implicit flows, you need to do static analysis, which is hard with closed-source apps, and cannot be done real-time

Implicit flows are not tracked
* Implicit flows can involve "taint-scope", tracking based on conditionals in code

=== Performance ===

The goal is to create a real time tracking system, so the TaintDroid's performance impact is of some importance

14% CPU overhead
4.4% memory overhead

Macro benchmarks (to get a feel for what the phone's usability is like with TD running)
* App load: 3% (2ms)

=== Findings ===

20 out of 30 tested applications share data in a way that is not expected.

67 of 105 flagged pieces of data leaving the device had no obviously legitimate purpose (verified by the authors).

Many apps sent location data and other unique identifiers to advertising servers.

Most apps do not mention anything to the user.

=== Limitations ===

Tracks only explicit data flows.

An application *could* launder the tags off of the data, if they really wanted to hide this sort of thing from TaintDroid.

There are methods that could be used to protect against this, but they go against the goal of a light-weight, real-time tracking system. TD is not necessarily about catching truly malicious programs, but rather just those that leak information.

Why do apps take this information?
* Lazy; in the demo video, the wallpaper app seems to use the IMEI just as a ready made unique ID
* Overzealous; the developer might thing they *need* the data for something, but actually
* Ads; advertises do seem a little presumptuous in their data collection
* Spying; bosses or spouses
* Malicious;

=== QA Period ===

Q: how do we prevent a malicious app from removing a taint attribute on a file

A: TD operates a too low a level for this to be a problem; TD assumes that the native code is trusted

Q: It seems like you had a lot of false positives

A: The point of this tool was to identify privacy sensitive information as having left the phone, not whether or not a privacy violation has taken place.

Q: Now that TD is released; couldn't malicious apps use some of the methods described in the paper to get around it?

A: Well, yes, but it is not just about maliciousness, it could just laziness or over-zealous ad stuff.

==Other Information==

Hey guys, thought I would just post a generalized paragraph about our essay.

In today’s society, Smartphones are the new big thing. To me that’s what makes this paper so interesting. This paper focuses on private information in android phones and the misuse of this information. The misuse of information includes the SIM card, the ID of the device, or the phone number. TaintDroid is used on smart phones with an efficient taint tracking and analysis system. It has the ability to track sensitive data from multiple sources and examines the misuse of such data. In their study, out of 80 popular third-party applications, TaintDroid monitored that 68 applications had potential misuse of user’s private data. This tool is great for knowing with applications are safe and which are not, so your private data can remained private.

Also, we should really think of splitting up the work in some way. If some people have specific sections they would like to do lets figure that out now so we can divide the workload and get it done over the next couple of days. I don't personally care what part I'm going to have to do, so lets get this going. Any other information people wanna post feel free the more the better, even if we don't end up using it.

[[user:Tmalone|Trevor Malone]]

Hey guys! Anything else we need to get done? Let me know and I can help in anyway possible.

[[user:Tmalone|Trevor Malone]]

==Relevant Sources==
*NEWSOME,J.,AND SONG,D.Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection]
Seems to be THE Dynamic Taint Analysis Paper.Talks about implementation on TaintCheck. Could be also useful for critique section -[Gautam]

COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:10:30Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. Essentially, this to make an arguement that TaintDroid could have been made with a set of better design choices.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode'' (2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T03:07:57Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung, Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow the ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf [2<nowiki>]</nowiki>][http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec [4<nowiki>]</nowiki>]

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf [5<nowiki>]</nowiki>] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf [1<nowiki>]</nowiki>] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information. However, in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm [6<nowiki>]</nowiki>]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf [7<nowiki>]</nowiki>] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [8<nowiki>]</nowiki>]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf [3<nowiki>]</nowiki>]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf [10<nowiki>]</nowiki>]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ [4<nowiki>]</nowiki>], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>], rely on instruction-level dynamic taint analysis using whole system emulation. One analyzer, Panorama Taint System, is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. As Panorama Taint System was one of the first dynamic taint analysis programs, a core feature in Panorama is the real-time abilities. However, Panorama used instruction-level analysis, and so had a high overhead. Most taint analyzing systems using instruction-level methods will result in the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of real-time analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html [4<nowiki>]</nowiki>][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf [11<nowiki>]</nowiki>] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf [9<nowiki>]</nowiki>] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone. Essentially, this to make an arguement that TaintDroid could have been made with a set of better design choices.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf [3<nowiki>]</nowiki>] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf [11<nowiki>]</nowiki>].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

COMP 3000 Essay 2 2010 Question 8

2010-12-02T01:03:03Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. Panorama Taint System is able to perform OS-aware whole system taint analysis to detect and analyze malicious code's information processing behavior. The core feature of this kind of taint analysis is realtime. But this method will lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:59:28Z

Sliske: /* Mathematical Model */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:56:17Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] FITZPATRICK, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:55:59Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. ''Twenty-First Annual Computer Security Applications Confrence (ACSAC),'' (2005)

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] Fitzpatrick, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006)

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security'' (2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:54:14Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] Fitzpatrick, M. [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. ''BBC News, Technology'' (March 2010)

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:53:51Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] Fitzpatrick, M. (March 2010). [http://news.bbc.co.uk/2/hi/technology/8559683.stm Mobile that allows bosses to snoop on staff developed]. BBC News, Technology.

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:51:25Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] ''Proceedings of the Network and Distributed System Security Symposium'' (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] ''GINP ENSIMAG GoogleCode''(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] ''http://pskl.us''

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] ''International Workshop on Run Time Enforcement for Mobile and Distributed Systems'' (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145, University of California, Berkeley'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

Talk:COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:49:35Z

Sliske: /* Work Plan */

Group Members

Trevor Bonesaw Malone - tmalone@connect.carleton.ca //FIRST POST!

Qi Zhang - qzhang13@connect.carleton.ca

Gregory Bint - gbint@connect.carleton.ca

Gautam Akiwate - gakiwate@connect.carleton.ca

Corey Ling - cling@connect.carleton.ca

Sarah Liske

==Relevant Sources==
*NEWSOME,J.,AND SONG,D.Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection]
Seems to be THE Dynamic Taint Analysis Paper.Talks about implementation on TaintCheck. Could be also useful for critique section -[Gautam]

== Work Plan ==

As Trevor intimated, we should have clear division of work going forward. This is sort of the break down as I see it. Please edit as you think of new ideas!

* Background Concepts
** Information Flow Theory. (Implicit and Explicit Flows.) --Done[--[[User:Gautam|Gautam]] 03:54, 28 November 2010 (UTC)]
** What is dynamic taint analysis --Done[--[[User:Gautam|Gautam]] 05:07, 28 November 2010 (UTC)]
** What is the difference between dynamic and static analysis - it's there, who added it?
* Research Problem
** How do we build a DTA engine for a phone? - done, but by who?
** Why do we want to? (information misuse) - done, but by who?
* Contribution
** How did they implement their DTA engine (Done: --[[User:Cling|Cling]] 04:50, 26 November 2010 (UTC))
** What did they find about information misuse (Done: --[[User:Cling|Cling]] 04:50, 26 November 2010 (UTC))
** Compared to the existing taint tracking approaches. [[User:Zhangqi|Zhangqi]] 07:11, 27 November 2010 (UTC) (Added something. Still looking for other examples,in progress)
** (What else should be in the contributions? Anything need fleshing out?) (Working on that now :) ) sliske
* Critique
**Added two paragraphs at the end of the present critique. Please incorporate it into your content as you deem fit.--[[User:Gautam|Gautam]] 09:07, 30 November 2010 (UTC)
**^ done. fleshed out critique, and added a bit about how taintdroid doesn't track implicit flow. Also reworded (the entire essay) for clarity where necessary/checked spelling. It would be a good idea for everyone to read it over once for spelling/clarity before thursday, just in case something doesn't make sense - sliske
* References
** The article has 61 references! We can probably use some of them
**whee! reading papers and sticking in information as need be. Also working out how to cite properly, as there are two citations used currently
references added and citations taken care of. will go over fill in a few places where information may be lacking after class sliske

List of information we need to find external sources for:
* History of taint analysis
* History of privacy research relating to smart phones

== Work In Progress ==

Log what you are working on *right now* so that other people don't try to do the same thing. Make sure to clear your name from here when you are done.

* Gregory Bint: Research Problem
** Need to find some history on smart phone security research for the second part.

* Gautam Akiwate: Background Concepts
** Any resources on Dynamic taint Analysis would be appreciated!

* Corey Ling: Contributions (Qi Zhang)

* Trevor Malone: Critique

* Sarah Liske: References and Questions, Clarity/Spelling.

== Some Notes from the Video ==

Tracking of privacy sensitive data through Dynamic Taint Analysis (aka. Taint Tracking). The trick is to mark private data as it sourced, and then follow those marks until (unless) they leave the phone.

Android phones run Java apps, which are compiled into DEX, and then run on top of the Dalvik VM. It is this VM that we modify so that we can support the storage and tracking of taint tags.

Taint sources
* low -bandwidth sensors
** Location
** Accelerometer
* High-bandwidth sensors
** Mic
** Camera
* Information DB
** Address book
** SMS storage
* Device ID
** IMEI
** IMSI (don't actually track this one because of false positives)
** ICC_ID
** Phone Number

Taint sink (where marked data can leave the phone)
* Network Taint Sink

Taint propagation
* ???

Taint tags are stored in memory interleaved with the variables they are tracking

Some standard Data Flow technique is used to propagate these tags, especially as one variable that is marked may be assigned to another, so now that variable needs to be tracked as well.

Tracks explicit flows of data, not implicit
To fully capture implicit flows, you need to do static analysis, which is hard with closed-source apps, and cannot be done real-time

Implicit flows are not tracked
* Implicit flows can involve "taint-scope", tracking based on conditionals in code

=== Performance ===

The goal is to create a real time tracking system, so the TaintDroid's performance impact is of some importance

14% CPU overhead
4.4% memory overhead

Macro benchmarks (to get a feel for what the phone's usability is like with TD running)
* App load: 3% (2ms)

=== Findings ===

20 out of 30 tested applications share data in a way that is not expected.

67 of 105 flagged pieces of data leaving the device had no obviously legitimate purpose (verified by the authors).

Many apps sent location data and other unique identifiers to advertising servers.

Most apps do not mention anything to the user.

=== Limitations ===

Tracks only explicit data flows.

An application *could* launder the tags off of the data, if they really wanted to hide this sort of thing from TaintDroid.

There are methods that could be used to protect against this, but they go against the goal of a light-weight, real-time tracking system. TD is not necessarily about catching truly malicious programs, but rather just those that leak information.

Why do apps take this information?
* Lazy; in the demo video, the wallpaper app seems to use the IMEI just as a ready made unique ID
* Overzealous; the developer might thing they *need* the data for something, but actually
* Ads; advertises do seem a little presumptuous in their data collection
* Spying; bosses or spouses
* Malicious;

=== QA Period ===

Q: how do we prevent a malicious app from removing a taint attribute on a file

A: TD operates a too low a level for this to be a problem; TD assumes that the native code is trusted

Q: It seems like you had a lot of false positives

A: The point of this tool was to identify privacy sensitive information as having left the phone, not whether or not a privacy violation has taken place.

Q: Now that TD is released; couldn't malicious apps use some of the methods described in the paper to get around it?

A: Well, yes, but it is not just about maliciousness, it could just laziness or over-zealous ad stuff.

==Other Information==

Hey guys, thought I would just post a generalized paragraph about our essay.

In today’s society, Smartphones are the new big thing. To me that’s what makes this paper so interesting. This paper focuses on private information in android phones and the misuse of this information. The misuse of information includes the SIM card, the ID of the device, or the phone number. TaintDroid is used on smart phones with an efficient taint tracking and analysis system. It has the ability to track sensitive data from multiple sources and examines the misuse of such data. In their study, out of 80 popular third-party applications, TaintDroid monitored that 68 applications had potential misuse of user’s private data. This tool is great for knowing with applications are safe and which are not, so your private data can remained private.

Also, we should really think of splitting up the work in some way. If some people have specific sections they would like to do lets figure that out now so we can divide the workload and get it done over the next couple of days. I don't personally care what part I'm going to have to do, so lets get this going. Any other information people wanna post feel free the more the better, even if we don't end up using it.

[[user:Tmalone|Trevor Malone]]

Hey guys! Anything else we need to get done? Let me know and I can help in anyway possible.

[[user:Tmalone|Trevor Malone]]

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:46:15Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. ''Communications of the ACM 19, 5'' (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. [http://www.usenix.org/events/sec04/tech/chow/chow_html/ Understanding Data Lifetime via Whole System Simulation]. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:44:55Z

Sliske: /* Questions */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf 11].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. Understanding Data Lifetime via Whole System Simulation. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11]

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:44:31Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 11] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. Understanding Data Lifetime via Whole System Simulation. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11]

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:44:30Z

Sliske: /* Critique */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 12], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 12] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. Understanding Data Lifetime via Whole System Simulation. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11]

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:43:36Z

Sliske: /* Contribution */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed, as opposed to relying on heuristics or manual labels. [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [http://www.usenix.org/events/sec04/tech/chow/chow_html/ 4], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 12], rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html 4][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 12] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. Understanding Data Lifetime via Whole System Simulation. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11]

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:38:15Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. Understanding Data Lifetime via Whole System Simulation. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11]

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:37:33Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[11] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., AND ROSENBLUM, M. Understanding Data Lifetime via Whole System Simulation. ''Proceedings of the 13th USENIX Security
Symposium'' (August 2004).

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:35:00Z

Sliske: /* Research problem */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf 10]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:34:47Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] CEARA, D., POTET, ML., et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[10] CHUNG LAM, L., CHIUEH, T., [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.1478&rep=rep1&type=pdf A General Dynamic Information Flow Tracking Framework for Security Applications]. ''Proceedings of the Annual Computer Security Applications Conference (ACSAC)'' (2006).

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:25:46Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[9] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:25:25Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[] ZHU, Y., JUNG, J., KOHNO, T., WETHERALL, D., [http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks]. ''Technical Report No. UCB/EECS-2009-145'' (2009)

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:23:58Z

Sliske: /* Research problem */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.pdf 9]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T19:20:21Z

Sliske: /* Static Taint Analysis */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:16:16Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2] HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5] D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8] NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[] YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:16:03Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[1] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[2]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[3] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[4] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[5]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[6] http://news.bbc.co.uk/2/hi/technology/8559683.stm

[7] SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[8]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:15:54Z

Sliske: /* Research problem */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm 6]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf 7] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf 8]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf 3]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:14:09Z

Sliske: /* Taint Analysis */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf 2] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 3] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec 4] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf 5] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:12:29Z

Sliske: /* Information Flow */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf 1] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:09:41Z

Sliske: /* Additional questions */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects. TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid is implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:08:30Z

Sliske: /* Static Taint Analysis */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:07:42Z

Sliske: /* Taint Analysis */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

''Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone) ''

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:06:28Z

Sliske: /* Research problem */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1353&rep=rep1&type=pdf]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:03:36Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.[REF]
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[REF]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243.

[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009)

[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005)

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).''

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04

[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.

[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us

[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:03:05Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.[REF]
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[REF]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243. 
[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009) 
[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005) 

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis]. In ''Proceedings of ACM Computer and Communications Security (2007).'' 
[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04 
[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.
[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us 
[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:02:41Z

Sliske: /* References */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.[REF]
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[REF]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243. 
[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009) 
[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005) 

[]YIN, H., SONG, D., EGELE, M., KRUEGEL, C., AND KIRDA, E. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf Panorama: Capturing system-wide Information Flow for Malware Detection and Analysis. In ''Proceedings of ACM Computer and Communications Security (2007).'' 
[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04 
[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.
[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us 
[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:00:41Z

Sliske: /* Critique */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.[REF]
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[REF]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243. 
[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009) 
[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005) 

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04 
[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.
[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us 
[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:00:33Z

Sliske: /* Critique */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.[REF]
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[REF]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 1] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243. 
[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009) 
[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005) 

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04 
[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.
[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us 
[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)

COMP 3000 Essay 2 2010 Question 8

2010-12-01T18:00:22Z

Sliske: /* Critique */

=TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones=
'''Authors:''' 
* William Enck, Patrick McDaniel ''The Pennsylvania State University'' 
* Peter Gilbert, Landon P. Cox ''Duke University'' 
* Byung-Gon Chun, Jaeyeon Jung Anmol N. Sheth ''Intel Labs''

[http://appanalysis.org/tdroid10.pdf Direct Link]

[http://www.appanalysis.org/ Official Website]

[http://www.youtube.com/watch?v=qnLujX1Dw4Y Video Demonstration]

=Background Concepts=
To follow these ideas in this paper, the ideas which form the basis of this theory have to be understood. All in all, the following two concepts can be said to be central to understanding this paper. 
==Information Flow==
Information flow as the name suggests is the transfer of information. This transfer of information can be between two processes or within a given process, for example, between variables. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Information Flow Theory tries to quantify this flow of information into a mathematical model. 
In a security model the information flow can be categorized into: 
===Explicit Flow===
Explicit flow is when information subject to security classifications is transferred to a variable or process which is not subject to the same or higher level of security, causing a security breach. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] The breach occurs because information is now more visible than it was intended to be. An example of explicit flow is shown below:
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
notsecure = secure</big> </code>
 
The information in 'secure' which is PRIVATE is transferred to 'notsecure' which is PUBLIC which is an information leak.
 

===Implicit Flow===
Implicit Flow is when information subject to security classifications is deduced indirectly. This the leakage of information occurs through the program control flow. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] Depending on the flow of the program the secure information can be compromised, as shown below 
 
<code>PRIVATE VAR <big>secure</big> 
PUBLIC VAR <big>notsecure 
if secure="blah blah" then: 
insecure=1 
else: 
insecure=0</big> </code>
 
Since can determine the value of information in secure using logic statements, we can indirectly access the secure information. Information leak due to implicit flows is much harder to detect and protect from, due to the indirect nature of implicit flows.

==Taint Analysis==

The basic premise of taint analysis is to follow the information flow of "tainted" variables to ensure that they do not create a security breach. Any variable that can be modified directly or indirectly by the user and can become a security vulnerability is "tainted". Through various operations the "taint" can be passed from variable to variable, propagating it. When a tainted variable is used to execute potentially dangerous commands a breach is logged, allowing detection of possible security concerns. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf] 

===Dynamic Taint Analysis===
Taint Analysis done at run-time is called as Dynamic Taint Analysis. The approach used in dynamic taint analysis is to label data originating from untrusted sources as tainted. The analysis keeps track of all the tainted data in memory and when such data is used in a potentially dangerous situation, a leak is logged. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf] This approach offers the capabilities to detect most input validation vulnerabilities with a very low false positive rate. However, the execution of the program is slower because of the additional checks being preformed. [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec] 

===Static Taint Analysis===
Static taint analysis is the technique used for detecting the over approximation of the set of instructions that are influenced by user input. The set of tainted instructions is computed statically by analyzing the sources of the program. [REF] The main advantage for static taint analysis is that it takes into account all the possible execution paths of the program. On the other hand the analysis may not be as accurate as a dynamic analysis because the static analysis does not have access to any additional run-time information of the program. [REF] 

===Mathematical Model===
<big><code>For all variables V = {T,U} ;T are tainted and U are untainted: 
Using <big>⊕</big>: V x V -> V, x, y ∈ V 
x<big>⊕</big>y = T; x = T OR y = T 
x<big>⊕</big>y = U, if x = U AND y = U
</code></big> 

It is now easy to see that whenever a tainted variable is used by another variable, the variable that used the tainted variable becomes tainted as well; the taint is propagated. Taking this further we can see that, if needed, we can tag variables as tainted by attaching to them a tainted tag, which can then be tracked or used as wanted. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf] 

Note: The paper talks about Dynamic Taint Analysis. TaintDroid makes ingenious use of "taint" to taint variables that are of value and tracks their progress. Though in the actual context of Taint Analysis "taint" is used for untrusted information however in this case the "taint" variables are infact important private data. (Just in case if it confused someone :D) For more detailed information on Taint Analysis refer "Detecting Software Vulnerabilities Static Taint Analysis"[2]

=Research problem=

In today’s society, smartphones are a prominent new technology. Smartphones, by their nature, are linked into many private details of our lives, including not only classic data like our contact list, but new kinds of data smartphones make available, such as location data. Smartphones also have the ability to download and run third party applications which can connect to the internet; indeed, this is why we call them "smart". Except for the odd tunnel or elevator, these phones are constantly connected to the internet. When you combine third party applications with an internet connection on a device that stores an immense amount of personal data, you suddenly find yourself unsure of how your data is being used; what is to stop a third party application from disseminating our private information? As it turns out, very little. [http://news.bbc.co.uk/2/hi/technology/8559683.stm]

A telling example of this is a wallpaper application that sends your phone number back to the developer.[http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf] Once the app is running on your phone, it can typically access any of the information on your phone that it has been given permission to access, and it is not necessarily clear when the application has accessed data, or what it is doing with it.

The authors of this paper set out to try to understand what kind of information is being collected and where that information is being sent, and in order to do that, they first needed to build a means of tracking that information.

The strategy they chose is called Dynamic Taint Analysis, sometimes called Taint Tracking. The basic idea being to mark (or ''taint'') sensitive information at its source, and to then follow that mark as the data moves through a system. In the context of this paper, if ever we should see marked data leave the network interface of the phone, then we know that some sensitive information has been disseminated.

There are many difficulties associated with implementing such a system on a smartphone. Their design goals were to create a light-weight, minimal overhead, real-time tracking system that runs directly on a real phone, with real applications. To be really useful, the tracking system must not impact the user experience too heavily.

Some implementation difficulties are:
* Smart phones are resource constrained. Processing power and memory are limited, and any processing that we do perform will consume battery power. If the tracking system is to be real-time, the phone must be considered "usable" by the end user, and so the system must be truly light weight.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf]
* Third party applications arrive in a compiled format; we cannot analyze their source code.[REF]
* Applications may do complex things with the sensitive data. It is unlikely that the application will simply read a location from the GPS and dump it straight out over the network. More likely is that the application will use that data in some way, or combine it with other data, before it is sent. We need to be able to track sensitive data throughout this entire process if we hope to perform any useful analysis.[REF]
* Applications can share information with other applications, meaning that our tracking has to work across multiple processes.[REF]
* The tracking must operate on a real phone, not a simulated one. With a simulated system, where we control the virtual hardware and memory, we can be certain that we can see everything that an application might do. On a real device, how can we get low-level enough to see everything the applications do?[REF]

=Contribution=
The main contribution of the TaintDroid paper is not that they achieved information flow tracking, but that they made it efficient enough to run in real time on real constrained hardware devices with minimal overhead. TaintDroid only causes roughly a 14% CPU overhead and approximately 4.4% memory overhead when tracking 32 taint markings per tainted data unit. It should also be noted that the 14% CPU over-head is only in regards to a "CPU-bound micro-benchmark and imposes negligible overhead on interactive third-party applications."(Enck et al., YEAR, p1)

This low overhead is achieved by modifying the code directly at the Java Virtual Machine (JVM) layer of the Android system to provide variable-level tracking. This allows direct control over how and what private information, such as location details from the GPS, is stored and accessed. [REF] Next, they modified the Java Native Interface (JNI) to provide message-level tracking which allows them to monitor inter-process, a.k.a. inter-application, communications. This also allows them to "patch the taint propagation on return." (Enck et al., YEAR, p3) so they can keep track of information transfer via native code. [REF] Finally, by modifying the network interface and secondary storage interfaces they are able to provide file-level taint tracking which enables them to ensure "persistent information conservatively retains its taint markings." (Enck et al., YEAR, p3).[REF]

Another contribution of TaintDroid is accuracy of tracking sensitive data. Unlike existing solutions that rely on heavy-weight, whole-system emulation [REF], the virtualized architecture of Android allows four levels of taint propagation: variable, method, message, and file. The granularity and flow semantics that TaintDroid offers highly influences the performance and accuracy of TaintDroid. Existing taint tracking approaches, like Panorama Taint System, rely on instruction-level dynamic taint analysis using whole system emulation. This method can lead to the system preforming from 2-20 times slower than normal, which is not suitable at all for the trend of realtime analysis.[http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html][http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.3789&rep=rep1&type=pdf] Moreover, instruction-level tracking faces a serious problem, taint explosion. When we use some complex instructions such as CMPXCHG, REP MOV,the stack pointer may become falsely tainted or taint loss. [REF] However, TaintDroid solved this problem with the combination of 4 levels of tracking. For example, the variable level allows TaintDroid to provide flow semantics for taint propagation, allowing distinction between different data pointers at different levels to ensure accuracy.

By combining these four levels (variable, method, message and file) of taint tracking, TaintDroid was able to effectively track 30 randomly selected, popular, 3rd party android applications. In doing so TaintDroid correctly flagged 105 instances of tainted information transmission. Of these 105, only 35 were legitimate risks.[REF] It also determined that 50% of the applications submitted the users location to advertising servers and 5 of the applications transmitted the users device ID, phone number and SIM card serial number. Clearly, the higher granularity is needed and TaintDroid is providing a step in the right direction, by providing a highly efficient real time tracking system.

=Critique=

This paper has quite a bit of information, but has a very strong structure in explaining what TaintDroid is and what it does, which makes it easy to read. The paper begins with a high-level overview of TaintDroid, then explains the history followed by an explanation of sources that are tracked by TaintDroid and its design. It continues with test results and the strengths and weaknesses of TaintDroid, with references to related work.

Challenges of monitoring network disclosure of privacy sensitive information are well outlined, as are TaintDroid's workarounds for these challenges. TaintDroid uses dynamic taint analysis to find a way around the challenges, using a taint source as the targeted sensitive information, and a taint marking to identify the information type. It is easy to see that this research was effective, due to the impressive number of information leaks that were found. TaintDroid effectively identifies information misuse at a high percentage. However, while the implementation is strong in that the overhead is so low and accuracy is high, there are trade-offs that were incurred to meet that overhead.

To prevent additional overhead, TaintDroid does not track implicit data flow or control flow. This partially is because the applications being tested are loaded onto the phone as black-box, precompiled binaries; but mostly because the Android JVM does not maintain branch structures, which TaintDroid could use to track implicit flow dynamically. It is presumed that branch structures are maintained at a kernel level, as a static analysis could uncover data leak stemming from implicit data flow, but dynamic analysis such as TaintDroid cannot. This means that applications can bypass the taint analysis by using implicit flow. There are also other issues, particularly in Taint Tag Storage, which are due to the fact that most string objects have the same tag. Because of the similar tags, it is possible for false positives to occur.

Further more, TaintDroid is a firmware modification, not an application which raises the questions of its usability by the average user. Being a firmware modification drastically reduces its usability unless 'Android System' itself incorporates these changes which is highly unlikely as the overheads, in this case a memory overhead of 4.4%, an IPC overhead of 27% and an overall 14% overhead, are on the higher side in an already resource constrained smartphone.

Consider a possible alternative implementation of TaintDroid. TaintDroid is incorporated in the firmware and hence incurs an additional overhead as the user uses the phone. Consider the implementation of 'TaintCheck' on an x86 platform.[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf 13] TaintCheck performs dynamic taint analysis on a program by running the program in its own emulation environment. This allows TaintCheck to monitor and control the program’s execution at a fine-grained level. All the TaintCheck needs is the binary which it the rewrites and uses it in its own emulated environment. What this essentially means is that 'TaintCheck' is a mechanism that can perform dynamic taint analysis by performing binary rewriting at run time on an emulated envioronment. Taking this further, we can consider an implementation of TaintDroid based on similar lines. One can then envision an application in which you uploaded the 'application binary' and then TaintDroid would return a result of whether the application is safe or not. This has the advantage of being needed to run just once before installation and hence the overheads won't be much of a concern. This can even allow TaintDroid to incorporate signature based detection of 'malacious applications'.

=Questions=
Possible exam questions and brief answers are listed below, along with the section to go to to find more information pertaining to that question.

===Anil's questions===
* What is one source of false positives in TaintDroid? (In other words, what kind of code/data behavior leads to false alarms?)
** Some applications are making legitimate use of sensitive data; for example, Google Maps needs to know your location in order to work, and the use of this data is known by the user. TaintDroid cannot know whether the user has consented to the use of some data, and so flags it as a leak. (Critique - Content)
* What part of Android was modified for TaintDroid? Is this part of Android's kernel? Explain briefly.
** The Dalvic VM is modified. Dalvic is the java virtual machine used by Android to run user applications. Although Dalvic is a core part of the Android operating system, it is not a part of the kernel. Dalvic runs on top of the Android kernel as a user process. All third party user applications run on top of Dalvic, however, so it is a sufficiently "low-level" point of the system to implement taint tracking. (Contribution)

===Additional questions===
* Although TaintDroid is adept at catching information leak, there are ways an application can bypass the TaintDroid filter. Describe one.
** An application could use implicit flow to derive data from tainted objects, as TaintDroid has no way to inspect implicit flow dynamically, due to no branch structures being maintained at the JVM layer, where TaintDroid has been implemented. Instead, control structures would be part of the pre-compiled application binaries, which, while not entirely black boxes, are impractical to investigate dynamically. Implicit flow data leaks can, however, be caught by a static analysis [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf].
* hmmm
** words go here
* hmmmmmmmmmm
** words go here

=References=
[] DENNING, D. E. [http://www.cs.georgetown.edu/~denning/infosec/lattice76.pdf A Lattice Model of Secure Information Flow]. Communications of the ACM 19, 5 (May 1976), 236–243. 
[]D CEARA, ML POTET et.al [http://tanalysis.googlecode.com/files/DumitruCeara_BSc.pdf Detecting Software Vulnerabilities Static Taint Analysis] GINP ENSIMAG GoogleCode(2009) 
[] NEWSOME,J.,AND SONG,D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.2141&rep=rep1&type=pdf Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software] Proceedings of the Network and Distributed System Security Symposium (NDSS 2005) 

[] CHOW, J., PFAFF, B., GARFINKEL, T., CHRISTOPHER, K., ROSENBLUM, M., [http://www.usenix.org/events/sec04/tech/chow/chow_html/index.html#sec:future Understanding Data Lifetime via Whole System Simulation] USENIX Secutiry '04 
[]HALDAR, V., CHANDRA, D., FRANZ, M. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.3118&rep=rep1&type=pdf Dynamic Taint Propagation for Java]. University of California.
[]SMITH, E. [http://www.pskl.us/wp/wp-content/uploads/2010/09/iPhone-Applications-Privacy-Issues.pdf iPhone Applications & Privacy Issues: An Analysis of Application Transmission of iPhone Unique Device Identifiers (UDIDs).] http://pskl.us 
[]NAIR, S. K., SIMPSON, P. N., CRISPO, B., AND TANENBAUM, [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.2676&rep=rep1&type=pdf A. S. A Virtual Machine Based Information Flow Control Systemfor Policy Enforcement.] International Workshop
on Run Time Enforcement for Mobile and Distributed Systems (REM 2007)