Soma-notes - User contributions [en]

Category:2011-O&C

2011-04-13T06:11:31Z

Hadi sajjadpour: Added new lines between numbers ( result of copy pasting from the PDF)

Please note that the majority of our efforts are contained on the "Discussion" page.
* [[DistOS-2011W Contracts and Observability Old First Page|Old First Page]]

This report was done by Tarjit Komal, Andrew Luczak, Scott Lyons and Seyyed Sajjadpour.

= Introduction =

This paper is an overview of a theoretical implementation of electronic contract negotiation
systems. We begin by exploring the previous work in the field of electronic contract resolution, and
we then outline the framework for QUORUM, a proposed system of electronic contract negotiation
and mediation with an integrated reputation system.

== Focus Of Study ==

The primary goal of this report is to provide some mechanism for a reliable and automated
contract negotiation framework, which system will ideally be functional over a distributed system.
As a secondary goal, we discuss a mechanism for observing these contracts; such a mechanism is
critical for determining when a contract has been properly fulfilled, which we show is a
requirement for the repeatable success of a contract negotiation system.

= Background =

== Automated Contracts ==

We define an automated contract as any contract between computer systems requiring a
minimum of human intervention. Some human effort may be required to define the guidelines by
which the system negotiates (i.e. contract with a reliable system with longer wait times as opposed
to a less-reliable system with shorter wait times), but the system should be able operate
autonomously for the entire contract period.

== Assumptions ==

In order to simplify the problem at hand, we make the following assumptions:

1) All participants in a contract (i.e. the client and the provider), are automated systems
(either a single machine, or a group of machines).

2) All machines have a unique universal identifier which is un-spoofable.

3) Any conduct violations arising from a violation of contract parameters can be handled
with perfect efficiency and judicial accuracy.

4) Machines have no mobility, in either a physical or network context.

We make these assumptions recognizing that the current implementation of the Internet
and other large systems make such a situation impossible (particularly the second assumption
wherein identifiers are un-spoofable and universal.

== Related Work & Project Basis ==

The field of electronic contracts has been approached in several different ways, with papers
describing several avenues of research; however, no paper we discovered provided a complete
system of electronic contract negotiation and validation. Based on what we’ve found in research
literature, Service Level Agreements (SLA) are the closest to our proposed system, and were the
basis for our framework.

Using the groundwork laid by the following research groups, we look at the problem at a higher
level, and focus on an arbitrary network (i.e. a WAN or LAN) that sees computers as citizens.
Citizens of this network will get the chance to observe contracts and act as witnesses for other
citizens. We will show how SLA-style contract parameters can be used as the benchmarks of a
contract verification system, how a reputation system can increase the overall reliability of the
contract system, and how a gossiping system can provide an effective propagation of system
capabilities.

=== SLA ===

Verma, in his paper [1], defines SLA to be “…a formal relationship that exists between a
service provider and its customer”. He also mentions some key components of SLAs such as: “a
description of the nature of service to be provided”; “the expected performance level of the service,
specifically its reliability and responsiveness”; and “the procedure for reporting problems with the
service”. These are authentic concerns for any electronic contract. In our system, we adapt Verma’s
defined components to be used by each contracted party to verify whether a contract has been
fulfilled or not. Verma also discusses various approaches to guaranteeing service, including an
insurance approach and an adaptive approach, which can be used by service providers.

=== Reputation ===

Groth et al., in their research [2] on the trustworthiness of contracts, propose two key
indicators of a contract’s potential reliability: the history of the contracting party, and the similarity
of the contract to other successful contracts. Groth et al. also lay out an effective template for
contracts (based in turn upon the IST-CONTRACT project) which we use and build upon in our
QUORUM system.

=== Gossiping ===

In a large network, disseminating information through broadcast updates between
everyone in the network will create bandwidth, latency and denial-of-service concerns. Hence,
there needs to be some other ways to disseminate information, while avoiding the above mentioned
concerns. In their paper[3], Kermarrec et al. define gossiping in distributed systems to be “…the
repeated probabilistic exchange of information between two members.” In other words, gossiping
is the random dissemination of information, with correct timeouts and periods (depending on the
size of the network). In most cases, gossiping would include exchanging lists of information with
randomly selected nodes.

In principle, most gossiping algorithms follow the framework provided by Kermarrec et al.:

1) Peer (Node) Selection: Selecting a few random nodes in the network.

2) Data Exchanged: Selecting what information to send to the selected nodes.

3) Data Processing: Once data has been received via gossiping, process that data and
decide what action to take.

Gossiping is a common method of dispersing information in distributed systems, and is
inspired by real life gossiping. Take, for example, the students in a university department as nodes
in your network. Once a new student joins the department, he establishes a friendship with a few
other students (nodes) in the department (network). Subsequently, these students may or may not
inform others in the department that the new student exists, disclosing the characteristics of the

new student. This gossiping continues, and at the end of a given time period each student has some
information about some of the other students in the department; it is highly unlikely that any one
student has information regarding every other student in the department, but every student is
known by some other student in the department.

== Case Study ==

The simplest and most common use case is that of a website suffering from a Denial of
Service attack or an unexpected traffic influx. In this example and for all future reference,
ForeverAlone.com is the Client and TooMuchBandwidth.net is the Provider. Some automated
process or monitor on the Client would be advised of the sudden surge in traffic and would arrange
to contact the Provider to request their services. A more detailed look into the request process is
detailed below.

[[File:Contract_timeline.png]]

= Contracts =

== Contract Template ==
Groth et al. provide an excellent template for a contract, in the form of a contract schema:

[[File:OC-2011_Table1.png]]

We recognize that this template is simplistic in nature, but it provides an adequate basis for
discussion of electronic contracts; any implementation of such a template would need a more
detailed structure and syntax. It is worth noting, however, that the IST CONTRACT group, upon
whose work Groth et al. based their template, explored the idea of structuring a contract in an XML-
style wrapper, which seems a logical progression from the above template.

== Contract Formation ==

The formation of a contract has three general steps:

1) The Client issues a Request For Proposal (RFP)

2) A Provider replies to Client with RFP

3) Both parties agree to terms

Once these three steps have been followed, the contract is set and, assuming the activating
condition is met, the normative conditions are put into force.

There exists a problem, however, in enforcing the formation of a contract: some mechanism
for verifying the origins of the contract is necessary to prevent the forging of contracts for the
purposes of self-promotion. This becomes particularly critical when a reputation system is in effect;
the ability to forge a completed contract would be an incredible advantage for a service provider

wishing to boost its reputation (a key component of the QUORUM system, which we discuss in a
later section of this paper).

We propose the following procedure for forming a contract:

1) Contract agreement proceeds as above

2) Once the contract has been ratified by the Client and Provider, each provides a copy of
the contract template to its neighbors

3) Each neighbor signs the contract as a witness, and propagates the contract across the
network as a whole.

This system will produce a group of multiple contracts, each signed by a different witness,
but all possessing the same contract identifier (since all copies share the same origin). These
contracts can later be collected into a single copy, with a list of all witnesses. Contracts can then be
verified based on the presence or absence of certain systems from the witness list. This observation
system is built upon a gossiping network protocol, which we describe in more detail later in this
paper.

== Contract Publication ==

In order to mitigate fraudulent reputation growth, the proposed witnessing mechanism
requires that the contract be propagated through the network. While only a minimum level of detail
needs to be shared in order to witness a contract, we recognize the need for a private option, where
two systems may forge a contract without any details becoming public knowledge. The QUORUM
system provides such an option, though any contract created in private cannot affect reputation; a
contract must be witnessed to affect reputation, and a contract must be published to be witnessed.

= QUORUM =

QUORUM, or Quantifiable Uniform Observation and Reporting of Unmanned Mediation, is
our proposed system for electronic contract negotiation and observation on a distributed system.
Ideally, QUORUM runs on every machine in the network, acting as a distributed cloud of
autonomous observers. In addition to the observers, QUORUM is also comprised of a reputation
system, which is built from the history of contracts on the network.

The observers of QUORUM have a single responsibility. When a contract is published, each
observer attempts to verify that the normative conditions are being or can be satisfied. Once the
expiration condition is reached, each observer reports back to reputation system according to what
they believe the final state of the contract is. These observers must operate upon some metric
which is appropriate to the contract; metrics for the observers may be a simple binary condition
(e.g. has system A provided a piece of information to system B) or a more detailed requirement.

== QUORUM Reputation ==

The QUORUM observers report back on a contract in one of three ways:

[[File:OC-2011_Table2.png]]

The reputation system tracks, for each contracting system on the network, a reputation
score based upon the published contract history of that party. Successful contracts increase
reputation score, while breached contracts decrease reputation score. Contracts that are voided by
both parties or negotiated in private (i.e. without witnesses or QUORUM observation) have a
neutral effect on reputation score (though such contracts will still appear in the contract history).
This reputation system can then be used by Clients to determine which Provider proposal to accept.

== QUORUM Gossip ==

In order to facilitate the formation of contracts, it is useful to have a regularly updated
notion of which systems have certain capabilities. For example, if a particular system is in need of
processing time, it needs to be able to quickly determine which providers can offer assistance. To
this end, QUORUM requires an inter-system communication network, and so we propose the
following gossiping system.

=== Gossiping in QUORUM ===

As QUORUM operates, there are four scenarios which involve gossiping: the entrance of a
node, the location of available services, the exchange of reputation information, and the detection of
network failure. We will address each of these separately. Various gossiping algorithms exist [4],
any of which are sufficient for the implementation of QUORUM. Given our current system, we have
derived a gossiping method that utilizes some components of existing methods, such as the
framework in [3].

Each node will have a list L of other nodes, in this list there exists the identity and known
services provided by that node. A node will update entries in L based upon the information it
receives through gossip messages.

=== Entrance ===

When a node joins a network, it will have the address of two existing nodes on the network.
After it joins the network, the two nodes will then randomly assign neighbors to the new node,
selecting them from the QUORUM via some algorithm. Depending on the size of the QUORUM, each
node will have a fixed h number of neighbors; the number of neighbors and how randomly these
neighbors are chosen is an area which needs to be investigated further.

The joining node also sends to its neighbors a list of services that it can provide. Once the
neighbouring nodes have received identified and received the services of the incoming node, they
can then use a gossiping algorithm of the QUORUM to propagate this information. In turn, the
neighbors send their lists to the joining node, making it aware of the various capabilities of the
network it has joined.

=== Search For Services ===

As we saw in the related work, it is highly unlikely (depending on the size of the QUORUM)
for a node to have knowledge of every other node in the system. A given node that is looking for a
particular service x, will first look up in his own list L (that has been updated via gossiping
messages), and if he cannot find service x with the required reputation, he will then ask around the
QUORUM to find the service he is looking for.

=== Failure Detection ===

Failure detection in distributed systems is a prominent sub-problem of distributed systems,
and has approached by several research groups (see [5-9], as referenced in [5]). Below we present
an aggregation of the above efforts, applied to the QUORUM.

Given the neighbor system of QUORUM, each neighbor can expect a gossip message from its
neighbors in a period of t seconds. If a neighbor of a node r fails to send such a message, then r will
send a heartbeat message to its neighbors (i.e. the pull model [7,8]) asking whether it is alive or not.
If r then receives a response, it knows that its neighbor is alive. If r does not receive a response
(after enough heartbeat messages to be convinced that his neighbor is dead), then it would remove
that node from the list of its neighbors, look for another neighbor, and communicate the node
failure (via gossiping messages) to the other members of the QUORUM. Given that this method
relies heavily upon the neighbors of a node, it might be prudent to implement an additional failure
detection method, such as that found in the work of Hayashibara et al.[9].

= Future Work And Conclusion =

Moving forward with QUORUM, one main objective would be to attempt an actual
implementation of the system according to our specifications. In our research we made several
assumptions, one of which was that computers are not mobile. There could be further research
done in examining and modifying our QUORUM to allow for system mobility.

= References =

[1] D.C Verma, Service Level Agreements on IP Networks, Proceedings of the IEEE, vol. 92, pp. 1382-
1388, September 2004.

[2] P. Groth, S. Miles, S. Modgil, N. Oren, M. Luck, G. Yolanda, Determining the Trustworthiness of
New Electronic Contracts, Engineering Societies in the Agents World X, 2009.

[3] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, SIGOPS Oper. Syst. Rev., 41(5):2-7,
2007.

[4] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. “Randomized Gossip Algorithms.” IEEE
Transactions on Information Theory, 52(6):2508-2530, June 2006

[5] S. Sajjadpour, Failure Detection in Distributed Systems, Distributed Operating Systems course,
Carleton University, Winter 2011

[6] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of
the IFIP International Conference on Distributed Systems Platforms and Open Distributed
Processing, 1998. The version used here is from 2007.

[7] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In
Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The
version I used here is from 2002.

[8] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In
21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[9] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings
of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-
75, 2004

Category:2011-O&C

2011-04-13T00:52:05Z

Hadi sajjadpour:

Please note that the majority of our efforts are contained on the "Discussion" page.
* [[DistOS-2011W Contracts and Observability Old First Page|Old First Page]]

This report was done by Tarjit Komal, Andrew Luczak, Scott Lyons and Seyyed Sajjadpour.

= Introduction =

This paper is an overview of a theoretical implementation of electronic contract negotiation
systems. We begin by exploring the previous work in the field of electronic contract resolution, and
we then outline the framework for QUORUM, a proposed system of electronic contract negotiation
and mediation with an integrated reputation system.

== Focus Of Study ==

The primary goal of this report is to provide some mechanism for a reliable and automated
contract negotiation framework, which system will ideally be functional over a distributed system.
As a secondary goal, we discuss a mechanism for observing these contracts; such a mechanism is
critical for determining when a contract has been properly fulfilled, which we show is a
requirement for the repeatable success of a contract negotiation system.

= Background =

== Automated Contracts ==

We define an automated contract as any contract between computer systems requiring a
minimum of human intervention. Some human effort may be required to define the guidelines by
which the system negotiates (i.e. contract with a reliable system with longer wait times as opposed
to a less-reliable system with shorter wait times), but the system should be able operate
autonomously for the entire contract period.

== Assumptions ==

In order to simplify the problem at hand, we make the following assumptions:

1) All participants in a contract (i.e. the client and the provider), are automated systems
(either a single machine, or a group of machines).
2) All machines have a unique universal identifier which is un-spoofable.
3) Any conduct violations arising from a violation of contract parameters can be handled
with perfect efficiency and judicial accuracy.
4) Machines have no mobility, in either a physical or network context.

We make these assumptions recognizing that the current implementation of the Internet
and other large systems make such a situation impossible (particularly the second assumption
wherein identifiers are un-spoofable and universal.

== Related Work & Project Basis ==

The field of electronic contracts has been approached in several different ways, with papers
describing several avenues of research; however, no paper we discovered provided a complete
system of electronic contract negotiation and validation. Based on what we’ve found in research
literature, Service Level Agreements (SLA) are the closest to our proposed system, and were the
basis for our framework.

Using the groundwork laid by the following research groups, we look at the problem at a higher
level, and focus on an arbitrary network (i.e. a WAN or LAN) that sees computers as citizens.
Citizens of this network will get the chance to observe contracts and act as witnesses for other
citizens. We will show how SLA-style contract parameters can be used as the benchmarks of a
contract verification system, how a reputation system can increase the overall reliability of the
contract system, and how a gossiping system can provide an effective propagation of system
capabilities.

=== SLA ===

Verma, in his paper [1], defines SLA to be “…a formal relationship that exists between a
service provider and its customer”. He also mentions some key components of SLAs such as: “a
description of the nature of service to be provided”; “the expected performance level of the service,
specifically its reliability and responsiveness”; and “the procedure for reporting problems with the
service”. These are authentic concerns for any electronic contract. In our system, we adapt Verma’s
defined components to be used by each contracted party to verify whether a contract has been
fulfilled or not. Verma also discusses various approaches to guaranteeing service, including an
insurance approach and an adaptive approach, which can be used by service providers.

=== Reputation ===

Groth et al., in their research [2] on the trustworthiness of contracts, propose two key
indicators of a contract’s potential reliability: the history of the contracting party, and the similarity
of the contract to other successful contracts. Groth et al. also lay out an effective template for
contracts (based in turn upon the IST-CONTRACT project) which we use and build upon in our
QUORUM system.

=== Gossiping ===

In a large network, disseminating information through broadcast updates between
everyone in the network will create bandwidth, latency and denial-of-service concerns. Hence,
there needs to be some other ways to disseminate information, while avoiding the above mentioned
concerns. In their paper[3], Kermarrec et al. define gossiping in distributed systems to be “…the
repeated probabilistic exchange of information between two members.” In other words, gossiping
is the random dissemination of information, with correct timeouts and periods (depending on the
size of the network). In most cases, gossiping would include exchanging lists of information with
randomly selected nodes.

In principle, most gossiping algorithms follow the framework provided by Kermarrec et al.:

1) Peer (Node) Selection: Selecting a few random nodes in the network.
2) Data Exchanged: Selecting what information to send to the selected nodes.
3) Data Processing: Once data has been received via gossiping, process that data and
decide what action to take.

Gossiping is a common method of dispersing information in distributed systems, and is
inspired by real life gossiping. Take, for example, the students in a university department as nodes
in your network. Once a new student joins the department, he establishes a friendship with a few
other students (nodes) in the department (network). Subsequently, these students may or may not
inform others in the department that the new student exists, disclosing the characteristics of the

new student. This gossiping continues, and at the end of a given time period each student has some
information about some of the other students in the department; it is highly unlikely that any one
student has information regarding every other student in the department, but every student is
known by some other student in the department.

== Case Study ==

The simplest and most common use case is that of a website suffering from a Denial of
Service attack or an unexpected traffic influx. In this example and for all future reference,
ForeverAlone.com is the Client and TooMuchBandwidth.net is the Provider. Some automated
process or monitor on the Client would be advised of the sudden surge in traffic and would arrange
to contact the Provider to request their services. A more detailed look into the request process is
detailed below.

[[File:Contract_timeline.png]]

= Contracts =

== Contract Template ==
Groth et al. provide an excellent template for a contract, in the form of a contract schema:

ANDREW TABLE 1 GOES HERE!!!

We recognize that this template is simplistic in nature, but it provides an adequate basis for
discussion of electronic contracts; any implementation of such a template would need a more
detailed structure and syntax. It is worth noting, however, that the IST CONTRACT group, upon
whose work Groth et al. based their template, explored the idea of structuring a contract in an XML-
style wrapper, which seems a logical progression from the above template.

== Contract Formation ==

The formation of a contract has three general steps:

1) The Client issues a Request For Proposal (RFP)
2) A Provider replies to Client with RFP
3) Both parties agree to terms

Once these three steps have been followed, the contract is set and, assuming the activating
condition is met, the normative conditions are put into force.

There exists a problem, however, in enforcing the formation of a contract: some mechanism
for verifying the origins of the contract is necessary to prevent the forging of contracts for the
purposes of self-promotion. This becomes particularly critical when a reputation system is in effect;
the ability to forge a completed contract would be an incredible advantage for a service provider

wishing to boost its reputation (a key component of the QUORUM system, which we discuss in a
later section of this paper).

We propose the following procedure for forming a contract:

1) Contract agreement proceeds as above
2) Once the contract has been ratified by the Client and Provider, each provides a copy of
the contract template to its neighbors
3) Each neighbor signs the contract as a witness, and propagates the contract across the
network as a whole.

This system will produce a group of multiple contracts, each signed by a different witness,
but all possessing the same contract identifier (since all copies share the same origin). These
contracts can later be collected into a single copy, with a list of all witnesses. Contracts can then be
verified based on the presence or absence of certain systems from the witness list. This observation
system is built upon a gossiping network protocol, which we describe in more detail later in this
paper.

== Contract Publication ==

In order to mitigate fraudulent reputation growth, the proposed witnessing mechanism
requires that the contract be propagated through the network. While only a minimum level of detail
needs to be shared in order to witness a contract, we recognize the need for a private option, where
two systems may forge a contract without any details becoming public knowledge. The QUORUM
system provides such an option, though any contract created in private cannot affect reputation; a
contract must be witnessed to affect reputation, and a contract must be published to be witnessed.

= QUORUM =

QUORUM, or Quantifiable Uniform Observation and Reporting of Unmanned Mediation, is
our proposed system for electronic contract negotiation and observation on a distributed system.
Ideally, QUORUM runs on every machine in the network, acting as a distributed cloud of
autonomous observers. In addition to the observers, QUORUM is also comprised of a reputation
system, which is built from the history of contracts on the network.

The observers of QUORUM have a single responsibility. When a contract is published, each
observer attempts to verify that the normative conditions are being or can be satisfied. Once the
expiration condition is reached, each observer reports back to reputation system according to what
they believe the final state of the contract is. These observers must operate upon some metric
which is appropriate to the contract; metrics for the observers may be a simple binary condition
(e.g. has system A provided a piece of information to system B) or a more detailed requirement.

== QUORUM Reputation ==

The QUORUM observers report back on a contract in one of three ways:

ANDREW TABLE 2 GOES HERE!!

The reputation system tracks, for each contracting system on the network, a reputation
score based upon the published contract history of that party. Successful contracts increase
reputation score, while breached contracts decrease reputation score. Contracts that are voided by
both parties or negotiated in private (i.e. without witnesses or QUORUM observation) have a
neutral effect on reputation score (though such contracts will still appear in the contract history).
This reputation system can then be used by Clients to determine which Provider proposal to accept.

== QUORUM Gossip ==

In order to facilitate the formation of contracts, it is useful to have a regularly updated
notion of which systems have certain capabilities. For example, if a particular system is in need of
processing time, it needs to be able to quickly determine which providers can offer assistance. To
this end, QUORUM requires an inter-system communication network, and so we propose the
following gossiping system.

=== Gossiping in QUORUM ===

As QUORUM operates, there are four scenarios which involve gossiping: the entrance of a
node, the location of available services, the exchange of reputation information, and the detection of
network failure. We will address each of these separately. Various gossiping algorithms exist [4],
any of which are sufficient for the implementation of QUORUM. Given our current system, we have
derived a gossiping method that utilizes some components of existing methods, such as the
framework in [3].

Each node will have a list L of other nodes, in this list there exists the identity and known
services provided by that node. A node will update entries in L based upon the information it
receives through gossip messages.

=== Entrance ===

When a node joins a network, it will have the address of two existing nodes on the network.
After it joins the network, the two nodes will then randomly assign neighbors to the new node,
selecting them from the QUORUM via some algorithm. Depending on the size of the QUORUM, each
node will have a fixed h number of neighbors; the number of neighbors and how randomly these
neighbors are chosen is an area which needs to be investigated further.

The joining node also sends to its neighbors a list of services that it can provide. Once the
neighbouring nodes have received identified and received the services of the incoming node, they
can then use a gossiping algorithm of the QUORUM to propagate this information. In turn, the
neighbors send their lists to the joining node, making it aware of the various capabilities of the
network it has joined.

=== Search For Services ===

As we saw in the related work, it is highly unlikely (depending on the size of the QUORUM)
for a node to have knowledge of every other node in the system. A given node that is looking for a
particular service x, will first look up in his own list L (that has been updated via gossiping
messages), and if he cannot find service x with the required reputation, he will then ask around the
QUORUM to find the service he is looking for.

=== Failure Detection ===

Failure detection in distributed systems is a prominent sub-problem of distributed systems,
and has approached by several research groups (see [5-9], as referenced in [5]). Below we present
an aggregation of the above efforts, applied to the QUORUM.

Given the neighbor system of QUORUM, each neighbor can expect a gossip message from its
neighbors in a period of t seconds. If a neighbor of a node r fails to send such a message, then r will
send a heartbeat message to its neighbors (i.e. the pull model [7,8]) asking whether it is alive or not.
If r then receives a response, it knows that its neighbor is alive. If r does not receive a response
(after enough heartbeat messages to be convinced that his neighbor is dead), then it would remove
that node from the list of its neighbors, look for another neighbor, and communicate the node
failure (via gossiping messages) to the other members of the QUORUM. Given that this method
relies heavily upon the neighbors of a node, it might be prudent to implement an additional failure
detection method, such as that found in the work of Hayashibara et al.[9].

= Future Work And Conclusion =

Moving forward with QUORUM, one main objective would be to attempt an actual
implementation of the system according to our specifications. In our research we made several
assumptions, one of which was that computers are not mobile. There could be further research
done in examining and modifying our QUORUM to allow for system mobility.

= References =

[1] D.C Verma, Service Level Agreements on IP Networks, Proceedings of the IEEE, vol. 92, pp. 1382-
1388, September 2004.

[2] P. Groth, S. Miles, S. Modgil, N. Oren, M. Luck, G. Yolanda, Determining the Trustworthiness of
New Electronic Contracts, Engineering Societies in the Agents World X, 2009.

[3] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, SIGOPS Oper. Syst. Rev., 41(5):2-7,
2007.

[4] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. “Randomized Gossip Algorithms.” IEEE
Transactions on Information Theory, 52(6):2508-2530, June 2006

[5] S. Sajjadpour, Failure Detection in Distributed Systems, Distributed Operating Systems course,
Carleton University, Winter 2011

[6] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of
the IFIP International Conference on Distributed Systems Platforms and Open Distributed
Processing, 1998. The version used here is from 2007.

[7] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In
Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The
version I used here is from 2002.

[8] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In
21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[9] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings
of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-
75, 2004

Category:2011-O&C

2011-04-13T00:45:01Z

Hadi sajjadpour:

Category:2011-O&C

2011-04-13T00:44:14Z

Hadi sajjadpour:

Category:2011-O&C

2011-04-13T00:42:54Z

Hadi sajjadpour:

DistOS-2011W Contracts and Observability Old First Page

2011-04-13T00:34:50Z

Hadi sajjadpour: Created page with "==****Changes to be viewed (delete once acknowledged by group)****== (TK) - I have provided a summary of key concepts that will help with the idea of resource allocation across t…"

==****Changes to be viewed (delete once acknowledged by group)****==
(TK) - I have provided a summary of key concepts that will help with the idea of resource allocation across the network under the summary for '''Heuristics for Enforcing Service Level Agreements in a Public Computing Utility''' ''(<--Scott can you link this directly to the summary...I have no idea how to)'' We can go more in depth with the concepts that catch your eyes. The paper is beautifully written and easy to understand

==Problem Outline==

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* How do I observe the acts of other agents, particularly public acts?
* What '''CAN''' be observed?
* How can contracts be made between computers/agents?
* How can we ensure that contracts are being upheld?
* What side effects does observance have? For example if everyone can see who buys something online, would that promote or demote using such website?

==Report Outline==

[https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B0GW1IZG3sfgZjI3YmM4NDAtNWE0Ny00YWUyLTgxMTctZWRiYzA3YjdmMzk4&hl=en Final Group Project Report]

[https://docs.google.com/present/view?id=dc5r38d8_2hfmsggfx&interval=60 Group presentation]

*Abstract
*Introduction
**Observability on a Network
*Automatic Contracts (System-to-System)
**What Can be Contracted?
**Determining When to Initiate a Contract
**States of a Contract
*Quantifiable Uniform Observation and Reporting of Unmanned Mediation (QUORUM)
**System Overview
***Roles in the QUORUM
***Gossip and Reputation
****QUORUM Cliques
***Validating a Contract, or How I Learned to Stop Worrying and Love the QUORUM
****Private Contracts
*Alternatives/Other Approaches to QUORUM
*The Future of QUORUM
*Conclusion

==Focus==

As we've discussed these topics we've decided that the focus of our report will be on *Contracts* and the Observation of their fulfillment. We are also under the assumption that participants are uniquely and universally identifiable.

==Members==
* Seyyed Hadi Sajjadpour
* Tarjit Komal
* Scott Lyons
* Andrew Luczak

Category:2011-O&C

2011-04-13T00:34:42Z

Hadi sajjadpour: Replaced content with "Please note that the majority of our efforts are contained on the "Discussion" page. * Old First Page"

Please note that the majority of our efforts are contained on the "Discussion" page.
* [[DistOS-2011W Contracts and Observability Old First Page|Old First Page]]

Category:2011-O&C

2011-04-11T21:19:37Z

Hadi sajjadpour: /* Report Outline */

Please note that the majority of our efforts are contained on the "Discussion" page.

==****Changes to be viewed (delete once acknowledged by group)****==
(TK) - I have provided a summary of key concepts that will help with the idea of resource allocation across the network under the summary for '''Heuristics for Enforcing Service Level Agreements in a Public Computing Utility''' ''(<--Scott can you link this directly to the summary...I have no idea how to)'' We can go more in depth with the concepts that catch your eyes. The paper is beautifully written and easy to understand

==Problem Outline==

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* How do I observe the acts of other agents, particularly public acts?
* What '''CAN''' be observed?
* How can contracts be made between computers/agents?
* How can we ensure that contracts are being upheld?
* What side effects does observance have? For example if everyone can see who buys something online, would that promote or demote using such website?

==Report Outline==

[https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B0GW1IZG3sfgZjI3YmM4NDAtNWE0Ny00YWUyLTgxMTctZWRiYzA3YjdmMzk4&hl=en Final Group Project Report]

[https://docs.google.com/present/view?id=dc5r38d8_2hfmsggfx&interval=60 Group presentation]

*Abstract
*Introduction
**Observability on a Network
*Automatic Contracts (System-to-System)
**What Can be Contracted?
**Determining When to Initiate a Contract
**States of a Contract
*Quantifiable Uniform Observation and Reporting of Unmanned Mediation (QUORUM)
**System Overview
***Roles in the QUORUM
***Gossip and Reputation
****QUORUM Cliques
***Validating a Contract, or How I Learned to Stop Worrying and Love the QUORUM
****Private Contracts
*Alternatives/Other Approaches to QUORUM
*The Future of QUORUM
*Conclusion

==Focus==

As we've discussed these topics we've decided that the focus of our report will be on *Contracts* and the Observation of their fulfillment. We are also under the assumption that participants are uniquely and universally identifiable.

==Members==
* Seyyed Hadi Sajjadpour
* Tarjit Komal
* Scott Lyons
* Andrew Luczak

Category:2011-O&C

2011-04-11T21:19:27Z

Hadi sajjadpour: /* Report Outline */

Please note that the majority of our efforts are contained on the "Discussion" page.

==****Changes to be viewed (delete once acknowledged by group)****==
(TK) - I have provided a summary of key concepts that will help with the idea of resource allocation across the network under the summary for '''Heuristics for Enforcing Service Level Agreements in a Public Computing Utility''' ''(<--Scott can you link this directly to the summary...I have no idea how to)'' We can go more in depth with the concepts that catch your eyes. The paper is beautifully written and easy to understand

==Problem Outline==

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* How do I observe the acts of other agents, particularly public acts?
* What '''CAN''' be observed?
* How can contracts be made between computers/agents?
* How can we ensure that contracts are being upheld?
* What side effects does observance have? For example if everyone can see who buys something online, would that promote or demote using such website?

==Report Outline==

[https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B0GW1IZG3sfgZjI3YmM4NDAtNWE0Ny00YWUyLTgxMTctZWRiYzA3YjdmMzk4&hl=en Final Group Project Report]
[https://docs.google.com/present/view?id=dc5r38d8_2hfmsggfx&interval=60 Group presentation]

*Abstract
*Introduction
**Observability on a Network
*Automatic Contracts (System-to-System)
**What Can be Contracted?
**Determining When to Initiate a Contract
**States of a Contract
*Quantifiable Uniform Observation and Reporting of Unmanned Mediation (QUORUM)
**System Overview
***Roles in the QUORUM
***Gossip and Reputation
****QUORUM Cliques
***Validating a Contract, or How I Learned to Stop Worrying and Love the QUORUM
****Private Contracts
*Alternatives/Other Approaches to QUORUM
*The Future of QUORUM
*Conclusion

==Focus==

As we've discussed these topics we've decided that the focus of our report will be on *Contracts* and the Observation of their fulfillment. We are also under the assumption that participants are uniquely and universally identifiable.

==Members==
* Seyyed Hadi Sajjadpour
* Tarjit Komal
* Scott Lyons
* Andrew Luczak

Category talk:2011-O&C

2011-04-04T18:06:27Z

Hadi sajjadpour: /* Summary */

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====

Hadi

Basically users install this plugin/extension on their browsers. This extension has some XML template for websites to specify their privacy policies.
Users on the other hand, on their browser, fill out a preference form. When a user visits a website that has this XML form on their side, then it reads it and informs the user
of how compatible/in what area the privacy policy of that website with respect to the preferences given by the user. The language they use to exchange (for the XML template) is called APPEL.
It has been developed by the World Wide Web Consortium (W3C) and officially recommended on April 16, 2002. (from Wikipedia, http://en.wikipedia.org/wiki/P3p)

Hence this is not related to what we want to do, given that we narrowed observability to only observability in contracts.

==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.

b) Data Exchanged: Selecting information to pass on the peers selected.

c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:

1) Dissemination

2) Peer Sampling

3) Topology Construction

4) Resource Management

In each, they update the framework thats mentioned above.

1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer selects a list of peers to send information to.

Data Exchanged: Some message is selected and sent.

Data Processing: Receiving peer processes the data [1]

2) Peer Sampling:

Peer sampling assumes that the cost and latency of contacting each peer is the same. However, realistically this is not the case. In our system, we need to take into account cost/number of paths etc. as all other peers of a peer are not located next to a given peer.

3) Topology Construction:

Here they mention that each node/peer only maintains a partial view of the entire system for practical reasons.

4) Resource Management

In this section, they mention the other use of gossiping, which is for resource management and monitoring such as failure detection. In this application, messages exchanged are about status information, such as "Are you alive?" or "I am alive" messages. These messages could be in the form of heartbeats.

In resource management, it could be used in resource allocation. They give an example of "a gossip-based approach to estimate which slice of a collection a node belongs has been proposed in."

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-04-04T18:05:33Z

Hadi sajjadpour: /* Summary */

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====
This is about information observation, not what we want to do.

Basically users install this plugin/extension on their browsers. This extension has some XML template for websites to specify their privacy policies.
Users on the other hand, on their browser, fill out a preference form. When a user visits a website that has this XML form on their side, then it reads it and informs the user
of how compatible/in what area the privacy policy of that website with respect to the preferences given by the user. The language they use to exchange (for the XML template) is called APPEL.
It has been developed by the World Wide Web Consortium (W3C) and officially recommended on April 16, 2002. (from Wikipedia, http://en.wikipedia.org/wiki/P3p)

==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.

b) Data Exchanged: Selecting information to pass on the peers selected.

c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:

1) Dissemination

2) Peer Sampling

3) Topology Construction

4) Resource Management

In each, they update the framework thats mentioned above.

1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer selects a list of peers to send information to.

Data Exchanged: Some message is selected and sent.

Data Processing: Receiving peer processes the data [1]

2) Peer Sampling:

Peer sampling assumes that the cost and latency of contacting each peer is the same. However, realistically this is not the case. In our system, we need to take into account cost/number of paths etc. as all other peers of a peer are not located next to a given peer.

3) Topology Construction:

Here they mention that each node/peer only maintains a partial view of the entire system for practical reasons.

4) Resource Management

In this section, they mention the other use of gossiping, which is for resource management and monitoring such as failure detection. In this application, messages exchanged are about status information, such as "Are you alive?" or "I am alive" messages. These messages could be in the form of heartbeats.

In resource management, it could be used in resource allocation. They give an example of "a gossip-based approach to estimate which slice of a collection a node belongs has been proposed in."

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-04-04T17:54:17Z

Hadi sajjadpour: /* Summary */

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====
This is about information observation, not what we want to do.
==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.

b) Data Exchanged: Selecting information to pass on the peers selected.

c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:

1) Dissemination

2) Peer Sampling

3) Topology Construction

4) Resource Management

In each, they update the framework thats mentioned above.

1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer selects a list of peers to send information to.

Data Exchanged: Some message is selected and sent.

Data Processing: Receiving peer processes the data [1]

2) Peer Sampling:

Peer sampling assumes that the cost and latency of contacting each peer is the same. However, realistically this is not the case. In our system, we need to take into account cost/number of paths etc. as all other peers of a peer are not located next to a given peer.

3) Topology Construction:

Here they mention that each node/peer only maintains a partial view of the entire system for practical reasons.

4) Resource Management

In this section, they mention the other use of gossiping, which is for resource management and monitoring such as failure detection. In this application, messages exchanged are about status information, such as "Are you alive?" or "I am alive" messages. These messages could be in the form of heartbeats.

In resource management, it could be used in resource allocation. They give an example of "a gossip-based approach to estimate which slice of a collection a node belongs has been proposed in."

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-04-04T17:53:44Z

Hadi sajjadpour: /* Summary */

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====
This is about information observation, not what we want to do.
==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.

b) Data Exchanged: Selecting information to pass on the peers selected.

c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:

1) Dissemination

2) Peer Sampling

3) Topology Construction

4) Resource Management

In each, they update the framework thats mentioned above.

"1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer selects a list of peers to send information to.

Data Exchanged: Some message is selected and sent.

Data Processing: Receiving peer processes the data [1]"

2) Peer Sampling:

Peer sampling assumes that the cost and latency of contacting each peer is the same. However, realistically this is not the case. In our system, we need to take into account cost/number of paths etc. as all other peers of a peer are not located next to a given peer.

3) Topology Construction:

Here they mention that each node/peer only maintains a partial view of the entire system for practical reasons.

4) Resource Management

In this section, they mention the other use of gossiping, which is for resource management and monitoring such as failure detection. In this application, messages exchanged are about status information, such as "Are you alive?" or "I am alive" messages. These messages could be in the form of heartbeats.

In resource management, it could be used in resource allocation. They give an example of "a gossip-based approach to estimate which slice of a collection a node belongs has been proposed in."

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-04-04T17:53:17Z

Hadi sajjadpour:

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====
This is about information observation, not what we want to do.
==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.
b) Data Exchanged: Selecting information to pass on the peers selected.
c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:
1) Dissemination
2) Peer Sampling
3) Topology Construction
4) Resource Management

In each, they update the framework thats mentioned above.

"1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer selects a list of peers to send information to.

Data Exchanged: Some message is selected and sent.

Data Processing: Receiving peer processes the data [1]"

2) Peer Sampling:

Peer sampling assumes that the cost and latency of contacting each peer is the same. However, realistically this is not the case. In our system, we need to take into account cost/number of paths etc. as all other peers of a peer are not located next to a given peer.

3) Topology Construction:

Here they mention that each node/peer only maintains a partial view of the entire system for practical reasons.

4) Resource Management

In this section, they mention the other use of gossiping, which is for resource management and monitoring such as failure detection. In this application, messages exchanged are about status information, such as "Are you alive?" or "I am alive" messages. These messages could be in the form of heartbeats.

In resource management, it could be used in resource allocation. They give an example of "a gossip-based approach to estimate which slice of a collection a node belongs has been proposed in."

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-04-04T17:06:26Z

Hadi sajjadpour:

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====
This is about information observation, not what we want to do.
==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.
b) Data Exchanged: Selecting information to pass on the peers selected.
c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:
1) Dissemination
2) Peer Sampling
3) Topology Construction
4) Resource Management

In each, they update the framework thats mentioned above.

"1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer P periodically chooses f >= 1 peers Q1, ... , Qf uniformly at random from the entire set of currently available peers.

Data Exchanged: A message is selected from the local cache and copied from one peer to another. In a push model, P forwards messages to each Qi; in a pull model, each Qi sends a message to P. A combination of the two is also possible.

Data Processing: Effectively, nothing special is done except storing the received message for a next iteration, or passing to a higher (application) layer. [1]"

2) Peer Sampling:

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-04-04T17:05:52Z

Hadi sajjadpour:

==Papers==

===Observability===

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* What '''CAN''' be observed?

==== Contract Monitoring ====

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

===== Abstract =====

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

===== Summary =====
Andrew

==== Monitoring Service Contracts ====

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

===== Abstract =====

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

===== Summary =====
Andrew

===Contracts===

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

==== AURIC ====
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

===== Abstract =====
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

===== Summary =====
TJ

==== Bandwidth ====
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

===== Abstract =====
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

===== Summary =====
Claimed by Scott

==== Dynamic Adaptation ====
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

===== Abstract =====
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

===== Summary =====
TJ

==== Heuristics for Enforcing Service Level Agreements ====
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

===== Abstract =====
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

===== Summary =====
One key concept that we should take from this paper is the way they decided how to allocate the resources. Here is a brief but excellent point to consider:
*In a public computing utility (PCU), the virtyal cluster (VC) management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the anchor points (AP) among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

'''''KEY CONCEPTS'''''

''The key features of the PCU Model are:''
*an ISP like service structure
*proposing the resource profiling scheme for resource registration
*addressing scalability by developing PCU structure made up of domains
*incorporating peering technology for inter-domain information dissemination
*SLA based service instantiation and monitoring

''The key concepts of the VCs idea in this paper are:''
*it mathematically formulates the trade-off between achieving the best QoS and reducing the system cost, making it best suitable for commercial infrastructures
*even though multiple services can occupy a single resource and the service–resource attachments can change with time, a virtualized static logical resource set exposed to the service origin (SO) hides the complexity
*being a semi-dynamic scheme, a VC can reshape itself matching the varying demand pattern, at the same time the static virtualization to the SO simplifying the service management
*the optimization based VC creation results in better resource utilization

''The key concept to anchor points:''
*By providing a representation of demand distribution in a network, the concept of anchor point enables a client-centric resource allocation for widearea services.

''The key attributes of Overload Partitions:''
*they are selected via an optimization process and they are shared among multiple services.
*Provides a cost effective, but still QoS obeying solution to handle demand spikes in the network

==== Service Level Agreement in Cloud Computing ====
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

===== Abstract =====
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

===== Summary =====
Claimed by Scott

==== Service Level Agreements on IP Networks ====

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

===== Abstract =====
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

===== Summary =====

(HS) This paper starts off by talking about different components of a service level agreement. These components include:
1) A description of the nature of service to be provided
2) The expected performance level of the service, specifically its reliability and responsiveness
3) The time-frame for response and problem resolution
4) The process for monitoring and reporting the service level
5) The consequences for the service provider not meeting its obligations
6) Escape clauses and constraints.

Then they give three examples of Service level agreements on IP Networks:
1) Network Connectivity Services
2) Hosting Services
3) Integrated services

And for each of the above three, they suggest some availability, performance and reliability clauses. I think that three notions
of 'availability, reliability and performance' could be three parameters that the scheme we are designing should have for each contract.

After this they discuss three different approaches to support SLAs
1) Insurance Approach
2) Provisioning Approach
3) Adaptive Approach

==== Trustworthiness of New Contracts ====

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

===== Abstract =====

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

===== Summary =====

Imagine this as a table:

*Type
**Whether this is an obligation or permission. A prohibition is modelled as an obligation not to do something, i.e. with a negative normative condition below
*Target
**The contract party obliged, prohibited or permitted by the clause.
*Activating Condition
**The circumstances under which the clause has force, parameterized by the variables specific to each instance.
*Normative Condition
**The circumstances under which the obligation is not being violated or the permission is being taken advantage of, parameterized by the variable specific to each instance. Therefore, for an obligation, the target must maintain the normative condition so as not to be in violation of the contract.
*Expiration Condition
**The circumstances under which the clause no longer has force, parameterized by the variables specific to each instance.

This paper provides a nice, straight-forward definition of what a contract is, and provides the above schema for a contract.

==== Web Privacy with P3P ====

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

===== Summary =====
This is about information observation, not what we want to do.

==Increasing Observability==

Like we discussed on Thursday, the real question when looking at observability is whether an action can be viewed, and who can view it. In the real world, you have a chance of being observed no matter what you do; the Internet, on the other hand, reduces this observability and instead offers a modicum of anonymity.

As the possibility of being observed increases, behavior adjusts to encourage the positive reputation of the actor or to conform with laws and regulations. This is the main benefit we wish to obtain by increasing the observability of digital actions. While omnipresent observation is possible on a computer network, in terms of observing contracts it might be more efficient to impose the possibility of being observed.

==== Gossiping in Distributed Systems ====
[1] Paper: Gossiping in Distributed Systems, Anne-Marie Kermarrec, Maarten van Steen, ACM Sigops 2007

===== Abstract =====

Gossip-based algorithms were first introduced for reliably disseminating data in large-scale distributed systems. However, their simplicity, robustness, and flexibility make them attractive for more than just pure data dissemination alone. In particular, gossiping has been applied to data aggregation, overlay maintenance, and resource allocation. Gossiping applications more or less fit the same framework, with often subtle differences in algorithmic details determining divergent emergent behaviour. This divergence is often difficult to understand, as formal methods have yet to be developed that can capture the full design space of gossiping solutions. In this paper, we present a brief introduction to the field of gossiping in distributed systems, by providing a simple framework and using that framework to describe solutions for various application domains.

===== Summary =====
Hadi

This paper talks about different applications of gossiping and different approaches. Of the most important applications are data dissemination and monitoring services (such as failure detection). Nearly all gossip style algorithms follow the following framework:

a) Peer selection: This is the process of selecting a list of peers, either uniformly at random, or based on some ranking criteria (e.g. proximity, need etc.) to send some data to.
b) Data Exchanged: Selecting information to pass on the peers selected.
c) Data Processing: Processing a data received.

"Each peer is equipped with a cache, consisting of references to other peers in the system." This cache could also store information about other peers. In our project, this could be the service every other peer provides and the reputation that the peers have.

The paper is then divided into a few categories:
1) Dissemination
2) Peer Sampling
3) Topology Construction
4) Resource Management

In each, they update the framework thats mentioned above.

"1) Data Dissemination: "Traditionally, gossip-based solutions have been used for data dissemination purposes. A standard approach toward dissemination is to simply let peers forward messages to each other [1]". The framework for this section is as follows:

Peer Selection: Each peer P periodically chooses f >= 1 peers Q1, ... , Qf uniformly at random from the entire set of currently available peers.

Data Exchanged: A message is selected from the local cache and copied from one peer to another. In a push model, P forwards messages to each Qi; in a pull model, each Qi sends a message to P. A combination of the two is also possible.

Data Processing: Effectively, nothing special is done except storing the received message for a next iteration, or passing to a higher (application) layer. [1]"

2) Peer Sampling:

===A Possible System for Increasing Observability of Contracts and Actions?===

In class on Thursday, Scott brought up the idea of tracking a contract by making a minimal set of details available to all (i.e., everyone knows the parties involved in the contract, and whether the contract was fulfilled). Taking this a little further, our group considered the existence of an anonymous, distributed quorum of observers.

This quorum would, upon the creation of a contract, be given a summary of the contract (for example, Company A has agreed to cache data for Company B on a given day, while Company B will reciprocate the following day). Over the term of the contract, the individual systems in the quorum would test the contract to see if the terms had been met. At the end of the contract period, the systems would provide a "vote" declaring whether they witnessed the contract being fulfilled.

This system could also be extended to monitor general actions. Consider again this set of observers, however, now they connect at random to various websites, and take a snapshot of all connections to it. At any given time, no other user knows which system the observers will be monitoring. In other words, the observers are analogous to police patrols, albeit with no set patrol route.

Category talk:2011-O&C

2011-03-29T16:23:07Z

Hadi sajjadpour: /* Summary */

Category:2011-Contracts

2011-03-17T17:19:20Z

Hadi sajjadpour:

This is the category page for things regarding contracts

* (TK) When I think of contracts, I think of:
** (TK) who is responsible for the terms and agreements
** (TK) who is made aware of the agreement? Is it broadcast to everyone or only a select few people?
** (TK) who/what is the governing body?
*** (SL) Does there even need to be a governing body?
** (TK) the first step should be to find a way to verify the different parties involved in the contract.
* (HS) What mechanism can be provided to enforce contracts?
* (SL) Should there be more of a feedback system than COMPLETE/INCOMPLETE?
* (SL) Does every contract have to be about the exchange of quantifiable "goods"?

[[Category:2011-O&C]]

Category:2011-Observability

2011-03-17T16:16:13Z

Hadi sajjadpour:

* (TK) How can we observe the information that we, our computer or ourselves, provide the "network" or public is not going to be maliciously used?
** (TK) One point that was discussed in class was the idea of a digital fingerprint. Is this really feasible? and how would it work?
** (HS) One way is to minimize what information you give away, or at least 'be aware' of what you are giving away. However, observability is not limited to just information that we send out. We are also looking for ways to observe contracts being fulfilled.
** (HS) Also besides the fingerprinting, there is no guarantee that once your information goes to a party that you want with any fingerprint mechanism or security measure, once they have it decrypted, you can't (assuming they are not caught) stop them from passing the information on.
[[Category:2011-O&C]]

Category:2011-Observability

2011-03-17T16:16:02Z

Hadi sajjadpour:

* (TK) How can we observe the information that we, our computer or ourselves, provide the "network" or public is not going to be maliciously used?
** (TK) One point that was discussed in class was the idea of a digital fingerprint. Is this really feasible? and how would it work?
** (HS) One way is to minimize what information you give away, or at least 'be aware' of what you are giving away. However, observability is not limited to just information that we send out. We are also looking for ways to observe contracts being fulfilled.
** (HS) Also besides the fingerprinting, there is not guarantee that once your information goes to a party that you want with any fingerprint mechanism or security measure, once they have it decrypted, you can't (assuming they are not caught) stop them from passing the information on.
[[Category:2011-O&C]]

Category:2011-Contracts

2011-03-17T16:10:44Z

Hadi sajjadpour:

Category:2011-Observability

2011-03-17T16:08:46Z

Hadi sajjadpour:

Category talk:2011-O&C

2011-03-16T22:14:10Z

Hadi sajjadpour:

Category:2011-Contracts

2011-03-15T18:06:14Z

Hadi sajjadpour:

This is the category page for things regarding contracts

[[Category:2011-O&C]]

Category:2011-Contracts

2011-03-15T18:05:08Z

Hadi sajjadpour:

This is the category page for things regarding contracts

* Hello
** oh!
[[Category:2011-O&C]]

Talk:DistOS-2011W Observability & Contracts

2011-03-15T18:04:09Z

Hadi sajjadpour: /* Test */ new section

Talk:DistOS-2011W Observability & Contracts

2011-03-15T17:36:46Z

Hadi sajjadpour: /* Summary */

Talk:DistOS-2011W Observability & Contracts

2011-03-15T17:36:26Z

Hadi sajjadpour:

Talk:DistOS-2011W Observability & Contracts

2011-03-15T17:33:27Z

Hadi sajjadpour:

Talk:DistOS-2011W Observability & Contracts

2011-03-15T17:32:36Z

Hadi sajjadpour:

DistOS-2011W Observability & Contracts

2011-03-12T19:46:16Z

Hadi sajjadpour: /* Problem Outline */

Please note that the majority of our efforts are contained on the "Discussion" page.

==Problem Outline==

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?
* How do I observe the acts of other agents, particularly public acts?
* What '''CAN''' be observed?
* How can contracts be made between computers/agents?
* How can we ensure that contracts are being upheld?
* What side effects does observance have? For example if everyone can see who buys something online, would that promote or demote using such website?

==Members==
* Seyyed Hadi Sajjadpour
* Tarjit Komal
* Scott Lyons
* Andrew Luczak

Talk:DistOS-2011W Observability & Contracts

2011-03-10T14:19:32Z

Hadi sajjadpour:

==Observability==

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?

== Contract Monitoring ==

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

=== Abstract ===

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

== Monitoring Service Contracts ==

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

=== Abstract ===

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

==Contracts==

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

== AURIC ==
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

=== Abstract ===
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

== Bandwidth ==
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

=== Abstract ===
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

== Dynamic Adaptation ==
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

=== Abstract ===
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

== Heuristics for Enforcing Service Level Agreements ==
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

=== Abstract ===
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

== Service Level Agreement in Cloud Computing ==
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

=== Abstact ===
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

== Service Level Agreements on IP Networks ==

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

=== Abstract ===
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

== Trustworthiness of New Contracts ==

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

=== Abstract ===

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

== Web Privacy with P3P ==

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

==Consumer Privacy: Balancing Economic and Justice Considerations==

M.Culnan, R. Biles, Journal of Social Issues

http://onlinelibrary.wiley.com/doi/10.1111/1540-4560.00067/full

=== Abstract ===

Check link for abstract, couldn't copy paste! This paper talks about government regulation, industry self-regulation and technological solutions with regards to the internet.

Talk:DistOS-2011W Observability & Contracts

2011-03-10T14:19:11Z

Hadi sajjadpour:

==Observability==

* How do we define 'public' action? How do we monitor 'public' action without monitoring every action?
* How can you make sure your agent is acting according to your instructions?
* How can we ensure that information we receive through a third-party is legitimate?

== Contract Monitoring ==

[http://dx.doi.org/10.1007/978-3-642-03668-2_29 Contract Monitoring in Agent-Based Systems: Case Study] from Lecture Notes in Computer Science by Jiří Hodík, Jiří Vokřínek and Michal Jakob, 2009

=== Abstract ===

Monitoring of fulfilment of obligations defined by electronic contracts in distributed domains is presented in this paper. A two-level model of contract-based systems and the types of observations needed for contract monitoring are introduced. The observations (inter-agent communication and agents’ actions) are collected and processed by the contract observation and analysis pipeline. The presented approach has been utilized in a multi-agent system for electronic contracting in a modular certification testing domain.

== Monitoring Service Contracts ==

[http://dx.doi.org/10.1007/3-540-45705-4_15 An Agent-Based Framework for Monitoring Service Contracts] from Lecture Notes in Computer Science by Helmut Kneer, Henrik Stormer, Harald Häuschen and Burkhard Stiller, 2002

=== Abstract ===

Within the past few years, the variety of real-time multimedia streaming services on the Internet has grown steadily. Performance of streaming services is very sensitive to traffic congestion and results very often in poor service quality on today’s best effort Internet. Reasons include the lack of any traffic prioritization mechanisms on the network level and its dependence on the cooperation of several Internet Service Providers and their reliable transmission of data packets. Therefore, service differentiation and its reliable delivery must be enforced on a business level through the introduction of service contracts between service providers and their customers. However, compliance with such service contracts is the crucial point that decides about successful improvement of the service delivery process. For that reason, an agent-based monitoring framework has been developed and introduced enabling the use of mobile agents to monitor compliance with contractual agreements between service providers and service customers. This framework describes the setup and the functionality of different kinds of mobile agents that allow monitoring of service contracts across domains of multiple service providers.

==Contracts==

* What can or can't be contracted?
* How can you quantify abstract resources?
* How can two or more parties agree with a minimum of intervention?

Some forms of contracts exist in the form of Service Level Agreements, and there have been efforts made to automate this process:

== AURIC ==
[http://dx.doi.org/10.1007/978-3-540-75694-1_21 AURIC: A Scalable and Highly Reusable SLA Compliance Auditing Framework] from Lecture Notes in Computer Science, by Hasan and Burkhard Stiller, 2007.

=== Abstract ===
Service Level Agreements (SLA) are needed to allow business interactions to rely on Internet services. Service Level Objectives (SLO) specify the committed performance level of a service. Thus, SLA compliance auditing aims at verifying these commitments. Since SLOs for various application services and end-to-end performance definitions vary largely, automated auditing of SLA compliances poses the challenge to an auditing framework. Moreover, end-to-end performance data are potentially large for a provider with many customers. Therefore, this paper presents a scalable and highly reusable auditing framework and a prototype, termed AURIC (Auditing Framework for Internet Services), whose components can be distributed across different domains.

== Bandwidth ==
[http://dx.doi.org/10.1007/978-3-540-30189-9_19 SLA-Driven Flexible Bandwidth Reservation Negotiation Schemes for QoS Aware IP Networks] from Lecture Notes in Computer Science by Gerard Parr and Alan Marshall, 2004.

=== Abstract ===
We present a generic Service Level Agreement (SLA)-driven service provisioning architecture, which enables dynamic and flexible bandwidth reservation schemes on a per- user or a per-application basis. Various session level SLA negotiation schemes involving bandwidth allocation, service start time and service duration parameters are introduced and analysed. The results show that these negotiation schemes can be utilised for the benefits of both end user and network provide such as getting the highest individual SLA optimisation in terms of Quality of Service (QoS) and price. A prototype based on an industrial agent platform has also been built to demonstrate the negotiation scenario and this is presented and discussed.

== Dynamic Adaptation ==
[http://dx.doi.org/10.1007/978-3-540-89652-4_28 Context-Driven Autonomic Adaptation of SLA] from Lecture notes in Computer Science, by authors Caroline Herssens, Stéphane Faulkner and Ivan Jureta, 2008.

=== Abstract ===
Service Level Agreements (SLAs) are used in Service-Oriented Computing to define the obligations of the parties involved in a transaction. SLAs define the service users’ Quality of Service (QoS) requirements that the service provider should satisfy. Requirements defined once may not be satisfiable when the context of the web services changes (e.g., when requirements or resource availability changes). Changes in the context can make SLAs obsolete, making SLA revision necessary. We propose a method to autonomously monitor the services’ context, and adapt SLAs to avoid obsolescence thereof.

== Heuristics for Enforcing Service Level Agreements ==
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.8674&rep=rep1&type=pdf Heuristics for Enforcing Service Level Agreements in a Public Computing Utility] A masters thesis paper by Balasubramaneyam Maniymaran.

=== Abstract ===
With the increasing popularity of consumer and research oriented wide-area applications,there arises a need for a robust and efﬁcient wide-area resource management system. Even though there exists number of systems for wide area resource management, they fail to couple the QoS management with cost management, which is the key issue in pushing such a system to be commercially successful. Further, the lack of IT skills within the companies arouses the need of decoupling service management from the underlying complex wide-area resource management. A public computing utility (PCU) addresses both these issues, and, in addition, it creates a market place for the selling idling computing resources.

This work proposes a PCU model addressing the above mentioned issues and develops heuristics to enforce QoS in that model. A new concept called virtual clusters (VCs) is introduced as semi-dynamic, service speciﬁc resource partitions of a PCU, optimizing cost, QoS, and resource utilization. This thesis describes the methodology of VC creation, analyses the formulation of a VC creation into an optimization problem, and develops solution heuristics. The concept of VC is supported by two other concepts introduced here namely anchor point (AP) and overload partition (OLP). The concept of AP is used to represent the demand distribution in a network that assists the problem formulation of the VC creation and SLA management. The concept of overload partition is used to handle the demand spikes in a VC.

In a PCU, the VC management is implemented in two phases: the ﬁrst is an off-line phase of creating a VC that selects the appropriate resources and allocates them for the particular service; and the second phase employs on-line scheduling heuristic to distribute the jobs/requests from the APs among the VC nodes to achieve load balancing. A detailed simulation study is conducted to analyze the performance of different VC conﬁgurations for different load conditions and scheduling schemes and this performance is compared with a fully dynamic resource allocation scheme called Service Grid. The results verify the novelty of the VC concept.

== Service Level Agreement in Cloud Computing ==
[http://knoesis.wright.edu/library/download/OOPSLA_cloud_wsla_v3.pdf SLAs in Cloud Computing] A paper written by Pankesh Patel, Ajith Ranabahu, Amit Sheth.

=== Abstact ===
Cloud computing that provides cheap and pay-as-you-go computing resources is rapidly gaining momentum as an alternative to traditional IT Infrastructure. As more and more consumers delegate their tasks to cloud providers, Service Level Agreements(SLA) between consumers and providers emerge as a key aspect. Due to the dynamic nature of the cloud, continuous monitoring on Quality of Service (QoS)attributes is necessary to enforce SLAs. Also numerous other factors such as trust (on the cloud provider) come into consideration, particularly for enterprise customers that may outsource its critical data. This complex nature of the cloud landscape warrants a sophisticated means of managing SLAs. This paper proposes a mechanism for managing SLAs in a cloud computing environment using the Web Service Level Agreement(WSLA) framework, developed for SLA monitoring and SLA enforcement in a Service Oriented Architecture (SOA). We use the third party support feature of WSLA to delegate monitoring and enforcement tasks to other entities in order to solve the trust issues. We also present a real world use case to validate our proposal.

== Service Level Agreements on IP Networks ==

By Dinesh C. Verma, IBM T. J Watson Research center
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1323286&tag=1

=== Abstract ===
Abstract: This paper provides an overview of service-level agreements in IP networks. It looks at the typical components of a service-level agreement, and identifies three common approaches that are used to satisfy service level agreements in IP networks. The implications of using the approaches in the context of a network service provider, a hosting service provider, and an enterprise are examined. While most providers currently offer a static insurance approach towards supporting service level agreements, the schemes that can lead to more dynamic approaches are identified.

== Trustworthiness of New Contracts ==

[http://dx.doi.org/10.1007/978-3-642-10203-5_12 Determining the Trustworthiness of New Electronic Contracts] from Lecture Notes in Computer Science by Paul Groth, Simon Miles, Sanjay Modgil, Nir Oren, Michael Luck and Yolanda Gil, 2009.

=== Abstract ===

Expressing contractual agreements electronically potentially allows agents to automatically perform functions surrounding contract use: establishment, fulfilment, renegotiation etc. For such automation to be used for real business concerns, there needs to be a high level of trust in the agent-based system. While there has been much research on simulating trust between agents, there are areas where such trust is harder to establish. In particular, contract proposals may come from parties that an agent has had no prior interaction with and, in competitive business-to-business environments, little reputation information may be available. In human practice, trust in a proposed contract is determined in part from the content of the proposal itself, and the similarity of the content to that of prior contracts, executed to varying degrees of success. In this paper, we argue that such analysis is also appropriate in automated systems, and to provide it we need systems to record salient details of prior contract use and algorithms for assessing proposals on their content. We use provenance technology to provide the former and detail algorithms for measuring contract success and similarity for the latter, applying them to an aerospace case study.

== Web Privacy with P3P ==

http://www.oreilly.de/catalog/webprivp3p/

This book talks about P3P and how companies and web developers can comply with p3p.
Also check http://www.w3.org/P3P/

==Consumer Privacy: Balancing Economic and Justice Considerations==

M.Culnan, R. Biles, Journal of Social Issues

http://onlinelibrary.wiley.com/doi/10.1111/1540-4560.00067/full

== Abstract ==

Check link for abstract, couldn't copy paste! This paper talks about government regulation, industry self-regulation and technological solutions with regards to the internet.

Talk:DistOS-2011W Observability & Contracts

2011-03-10T14:06:46Z

Hadi sajjadpour:

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T02:20:50Z

Hadi sajjadpour: /* References */

Seyyed Sajjadpour

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model [3,4]]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model [3,4]]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

[[File:Fig5_hadi.png|800px|thumb|center|Gossip + Hierarchical model]]

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.

[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T02:19:31Z

Hadi sajjadpour: /* Introduction */

Seyyed Sajjadpour

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model [3,4]]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model [3,4]]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

[[File:Fig5_hadi.png|800px|thumb|center|Gossip + Hierarchical model]]

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T02:18:05Z

Hadi sajjadpour:

Seyyed Sajjadpour

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model [3,4]]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model [3,4]]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

[[File:Fig5_hadi.png|800px|thumb|center|Gossip + Hierarchical model]]

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:23:20Z

Hadi sajjadpour: /* The Pull Model */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model [3,4]]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model [3,4]]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

[[File:Fig5_hadi.png|800px|thumb|center|Gossip + Hierarchical model]]

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:23:04Z

Hadi sajjadpour: /* The Push Model */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model [3,4]]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

[[File:Fig5_hadi.png|800px|thumb|center|Gossip + Hierarchical model]]

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:17:29Z

Hadi sajjadpour: /* Gossip-Style Failure Detection */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

[[File:Fig5_hadi.png|800px|thumb|center|Gossip + Hierarchical model]]

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

File:Fig5 hadi.png

2011-03-09T00:14:36Z

Hadi sajjadpour:

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:13:48Z

Hadi sajjadpour: /* The Hierarchical Model */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

[[File:fig4_hadi.png|700px|thumb|center|The hierarchical model [3]]]

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

File:Fig4 hadi.png

2011-03-09T00:13:12Z

Hadi sajjadpour:

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:11:42Z

Hadi sajjadpour: /* The Hierarchical Model */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary)
1) Clients

2) Monitors

3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:11:24Z

Hadi sajjadpour: /* More Complex Methods */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method

2) Gossip-style Failure Detection

3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary) 1) Clients 2) Monitors 3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:09:24Z

Hadi sajjadpour: /* The Dual Scheme */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [3] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [3].

[[File:Fig3_hadi.png|500px|thumb|center|The Dual Model[3]]]

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method
2) Gossip-style Failure Detection
3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary) 1) Clients 2) Monitors 3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

File:Fig3 hadi.png

2011-03-09T00:08:03Z

Hadi sajjadpour:

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:07:08Z

Hadi sajjadpour: /* References */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [4] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [4].

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method
2) Gossip-style Failure Detection
3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary) 1) Clients 2) Monitors 3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002

[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004

[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007

[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985

[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.

[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002

[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006

[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006

[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996

[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996

[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006

[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

DistOS-2011W Failure Detection in Distributed Systems

2011-03-09T00:06:16Z

Hadi sajjadpour: /* The Pull Model */

= Abstract =

Failure detection has been studied for some time now. Failure detection is valuable to distributed systems as it adds to their reliability and increases their usefulness. Hence, it is important for distributed systems to be able to detect and cater to failures when they occur efficiently and accurately [3].

= Introduction =

In my literature review, I first talk a little about the history of failure detectors in distributed systems, then move on to define the most used/known protocols used to monitor in between local processes, then move on to introducing their drawbacks, and then mention some work done in solving those drawbacks.

Most of what I learnt in this task were from papers [3] and [4] as they have beautifully categorized everything.

= History =

Some work done in the 90’s was related to solving the consensus problem. Consensus roughly means processors/ units agreeing on a common decision despite failures [12].

In the paper of J. Fischer et al. [7] prove that the fault-tolerant cooperative computing (consensus) cannot be solved in a totally asynchronous model. Asynchronous distributed systems are ones that message transmission rates are unbounded [12, 13]. They conclude that there needs to be models that take more realistic approaches based on assumptions on processors and communication timings. However, Chandra and Toueg [12 from 4] mention that by augmenting the asynchronous system model with failure detectors it would be possible to bypass the impossibility in [7].

In [1], they suggest that an independent failure management system should be such that when it is investigating a service that is not responding, it will contact the Operating system that is running it to obtain further confidence. Their independent failure detector should have the following three functional modules:
“
1. A library that implements simple failure management functionality and provide the API to the complete service.
2. A service implementing per node failure management, combining fault management with other local nodes to exploit locality of communication and failure patterns.
3. An inquiry service closely coupled with the operating system which, upon request, provides information about the state of local participating processes.” [1]

In conclusion of their work, they suggest that failure detection should be a component of the operating system. Most work done after this, tries to go by this.

All failure detection algorithms/schemes use time as a means to identify failure. There are different protocols/ways of using this tool. They vary on how this timeout issue should be addressed, when/where messages should be sent and how often, synchronous or asynchronous. For small-distributed networks, such as LANs etc., coordination and failure detection is simple and does not require much complexity.

= Protocols =

All failure detection mechanisms use time as a means to identify failure. As mentioned, the two most famous ones are the push and pull strategies. In [3], they also introduce a combination of both push and pull.

== The Push Model ==

Assuming that we have two processes, p and q. With p being the process that is the monitor. Process q will be sending heartbeat messages every t seconds, hence process p will be expecting a “I am alive message” from q every t seconds. Heartbeat messages are messages that are sent on a timely bases to inform/get informed about a process. If after a timeout period T, p does not receive any messages from q, then it starts suspecting q [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:fig1.png|400px|thumb|center|Push Model]]

As you can see in the figure, process q is sending heartbeat messages periodically. At one point it crashes and it stops sending messages. Process p, waiting for heartbeat messages, at this point does not receive any more messages, and after a timeout period of T, suspects that q has failed.

== The Pull Model ==

Assume that we have the same parameters as above. In this model, instead of the process q (the one that has to prove its alive) sending messages to p every few seconds, it sends ‘are you alive?’ messages to q every t seconds, and waits for q to respond ‘yes I am alive’ [3,4]. The figure below is used in both [3] and [4] with a minute difference in each.

[[File:Fig2.png|400px|thumb|center|The Pull Model]]

In the above figure, p repeatedly checks if q is alive, and if q is alive it will respond with the 'I am alive' heartbeat message. Once q crashes, p does not receive any more responses from q and starts suspecting the failure of q after a timeout T time.

== The Dual Scheme ==

The pull model is somehow inefficient as there are potentially too many messages sent between the processes. The pull model is a bit more efficient in this manner. Hence, in [4] they propose a model that is a mix of the two. During the first message sending phase, any q like process that is being monitored by p, is assumed to use the push model, and hence send “ I am alive” or “liveness” messages to p. After some time, p assumes that any q like process that did not send a liveness message is using the pull model. The figure below is only in [4].

= More Complex Methods =

Although the above-mentioned protocols can help us in detecting failures in systems, however they do not scale well in large distributed systems. This is due mostly due to bandwidth limitations, message overload etc, hence the scalability problem is the major concern [3]. There have been various different approaches to solving this problem. In my review, I will touch upon the three that I investigated, which are

1) The Hierarchical Method
2) Gossip-style Failure Detection
3) The Accrual Failure Detector

== The Hierarchical Model ==

This model is an appropriate model for a LAN. In [3], they introduce three different elements exists (by summary) 1) Clients 2) Monitors 3) Monitorable objects in the hierarchical model. I will illustrate this by the figure below from [3], note I redrew the figure myself to avoid any copyright issues.

In the figure, FD1, FD2, FD3 and FD3’ are all monitors/failure detectors. As you can notice the monitors are not all in the same LAN, however, each monitor only monitors monitorable objects in its own LAN. While monitoring the monitorables, they also notify the clients of the status of the objects they need to know about. This model greatly reduces the amount of messages exchanged in between process. Instead of the clients monitoring every object that is monitorable, monitors do so and notify the clients when necessary or when they ask for it. On top of that, monitors also ask other monitors about their monitorable objects, hence they do not need to communicate with every other monitorable object. Again this reduces the heartbeat messages exchanged. In the given example, even if FD3 or FD3’ fail, the clients will not notice it as messages can be routed through another path.

== Gossip-Style Failure Detection ==

The hierarchical configuration seems to be a good choice for a few LANs working together, however, it still does not solve our problem of larger scale networks, such as WANs or over the Internet distributed work.

Gossiping in distributed systems is the repeated probabilistic exchange of information between two members [6]. The first use of Gossiping in distributed systems first appeared in [8, from 6].

Gossiping dynamically changes information among peers. Each peer has a cache/list of other peers. Traditionally gossiping in distributed systems has been used to disseminate information/etc to other peers [6]. Most gossiping approaches have the following three tasks:

1) Peer Selection: In this task, each peer must choose some peers to send data to. This could be on a schedule, or it could be done randomly.

2) Data Exchanged: The peer sending the data, must select some data to send to the peers it has chosen.

3) Data Processing: The peer at the receiving end decides what to do with the data sent from other peers [6].

In our case, we are interested in the use of gossiping for failure detection. In [2], they propose a Gossip-Style failure detection mechanism that we will look at here.

In gossips, a member sends information to randomly chosen members [2]. Their gossips and gossips in general mix the efficiency of hierarchical dissemination and the robustness of flooding protocols [2]. Their protocol gossips to figure out who else is still gossiping.

Each member maintains a list of each known member, its address and a heartbeat counter that will be the basis of judgment of failure. The heartbeat counter is mapped to the member each member in the list. Every Tg seconds, each member increments its own heartbeat and selects members to send its list to. Receiving members, will then merge the arrived list with their own list, and adopt the maximum heartbeat of each member in the list. Each member also keeps track of the last time a members heartbeat was incremented. If the heartbeat of a member has not increased in Tf (T fail) seconds, than that member is suspected for failure. However, given that it might take some time before a member might get an update, or one member has passed Tf of another member, but others might not necessarily have done so, they introduce another time variable Tc , such that they once they past this time, they can have greater confidence in the failure of a suspected node.

Each member gossips at regular intervals, but these intervals are not synchronized with each other.

The above mentioned is the core of the algorithm, in brief, each member has a list of other members with a heartbeat counter, once every few seconds, it randomly chooses other members to gossip its new list to and increments its own heartbeat. At the receiving end, the member merges his list with the incoming list. If for a given member, a heartbeat is not incremented with an adequate time, then that member will be suspect for failure.

The paper then goes on analyzing their proposed scheme and calculating error detection time in different scenarios and also playing around with parameters. They investigate against; number of failed members, number of mistakes and against probability of message lost. They also investigate different what values they should choose for Tc, Tf with respect to the number of members, expected failure rate, expected message lost etc.

=== Expanding the Gossip (Multi-level Gossiping) ===

Their scheme so far works well for a subnet setting. They expand their gossiping scheme to a multi level gossiping scheme. To avoid using too much bandwidth, most gossiping is done in the same subnet, with few gossiping messages done in between subnets, and even fewer between domains. Their protocol wants to have on average one member per subnet to gossip another member in another subnet in each round. To achieve this, every member tosses a weighted coin every time it gossips. One out of n times, where n is the size of the subnet, it picks a random subnet within its domain, and random host to gossip to. Then the member tosses another coin to choose another domain.

In [3], they draw a diagram in which they mix both the hierarchical model and the [2]’s gossip model. The figure is as follows:

There have been other work done in gossip-style failure detection [15].

== The Accrual Failure Detector ==

In their paper N. Hayashibara et al. in [5] present a new approach to failure detection. They argue that it is difficult to satisfy several application requirements simultaneously while using classical failure detectors. They say that maintaining a certain level of quality-of-service on different requirements and at the same time performing failure detection must still allow tuning of services to meet their needs. They introduce the accrual failure detectors that are not based on a Boolean; either a process is a) correct/working b) suspected.

In the accrual model, a monitor will output a value on a continuous scale rather than Booleans. In this protocol, it is the level of confidence in a suspicion that changes. A process’s level of suspicion increases over time by not receiving on time heartbeats. Their protocol samples heartbeats coming from different hosts, analyzes it, and uses that info to predict what pattern the next heartbeats will follow.

They use a function, susp_level p(t) >= 0. This function will grow if the suspicion level of a process p increases. It will decrease if a process p is back up, and shouldn’t change much if it’s working.

There has been some more work done in this field such as [10].

= Further Work =

There has also been other work done in this field that I did not get a chance to deeply analyze. Such works include quality of Service of failure detectors [9].

There has also been work done on dynamic distributed systems. N. Sridhar in his paper [11] present local failure detectors that can tolerate mobility and topology changes.

There is also work done in distributed wireless networks [14].

= Conclusion =

In my literature review, I talked about some work that had been done in the past, introduced some protocols that are used among processes to detect failures then expanded it to wider area networks with more sophisticated protocols and methods that I learnt through out reading different papers.

= References =

[1] W. Vogels. World wide failures. In proceeding of the ACM SIGOPS 1995 European Workshop, 1995.

[2] R. Van Renesse, Y. Minsky, M. Hayden, A gossip-style failure detection service. In proceeding of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 1998. The version used here is from 2007.

[3] P. Felber, X. Defago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, 1999. The version I used here is from 2002.
[4] N. Hayashibara, A. Cherif, T. Katayama, Failure Detectors for Large-Scale Distributed Systems, In 21st IEEE Symposium of Reliable Distributed Systems (SRDS’ 02). 2002
[5] N. Hayashibara, X. Defago, R. Yared, T. Katayama. The φ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS’ 04), pages 55-75, 2004
[6] A. Kermarrec, M. van Steen, Gossiping in Distributed Systems, ACM SIGOPS Operating Systems Review. 2007
[7] J.Fisher, N. Lynch and M. Paterson, Impossibility of Distributed Consensus with One Faulty Process, Journal of the ACM, 1985
[8] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. “Epidemic Algorithms for Replicated Database Maintenance.” In Proc. Sixth Symp. on Principles of Distributed Computing, pp. 1-12, Aug. 1987. ACM.
[9] W. Chen, S. Toueg, M.K. Aguilera, On the Quality of Service of Failure Detectors, IEEE transactions on Computers, 2002
[10] N. Hayashibara, M. Takizawa, Design of a Notification System for the Accrual Failure Detector, 20th International Conference on Advanced Information Networking and Applications – volume 1 (AINA’ 06). 2006
[11] N. Sridhar, Decentralized Local Failure Detection in Dynamic Distributed Systems, In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems. 2006
[12] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Jouranl of the ACM. 1996
[13] T. Chandra, V. Hadizlacos, S. Toueg, The Weakest Failure Detector for Solving Consensus, Journal of the ACM (JACM), 1996
[14] J. Chen, S. Kher, A. Somani, Distributed fault detection of wireless sensor networks, In proceedings of the 2006 workshop on dependability issues in wireless ad hoc networks and sensor networks DIWANS ’06, 2006
[15] S. Ranganathan, A. George, R. Todd, M. Chidester, Gossip-Style Failure Detection an Distributed Consensus for Scalable Heterogeneous Clusters, Cluster Computing, 2001

File:Fig2.png

2011-03-09T00:05:28Z

Hadi sajjadpour: