Sunday, February 26, 2012

Designing Distributed Database Systems for Efficient Operation

Designing Distributed Database Systems for Efficient Operation

Sangkyu Rho, Salvatore March

Distributed database systems can yield significant cost and performance advantages over centralized systems for geographically distributed organizations. The efficiency o f a distributed database depends primarily on the data allocation (data replication and placement) and the operating strategies (where and h ow retrieval and update query processing operations are performed). W e develop a distributed database design approach that comprehensively treats data allocation and operating strategies, explicitly modeling their interdependencies for both retrieval and update processing. W e demonstrate that data replication, join node selection, and data reduction b y semijoin are important design and operating decisions that have significant impact on both the cost and response time of a distributed database system.

Tuesday, February 21, 2012

On the impact of network latency on distributed systems design

link: http://www.springerlink.com/content/v8x37536u2343u86/fulltext.pdf

Research in distributed database systems to date has assumed a “variable cost” model of network response time. However, network response time has two components: transmission time (variable with message size) and latency (fixed). This research improves on existing models by incorporating a “fixed plus variable cost” model of the network response time. In this research, we: (1) develop a distributed database design approach that incorporates a “fixed plus variable cost”, network response time function; (2) run a set of experiments to create designs using this model, and (3) evaluate the impact the new model had on the design in various types of networks.

This is a followup paper by the same author:
Modeling Network Latency and Parallel Processing in Distributed Database Design

Sunday, February 19, 2012

Thialfi: A Client Notification Service for Internet-Scale Applications

http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/10-adya.pdf

Abstract. Ensuring the freshness of client data is a fundamental problem for
applications that rely on cloud infrastructure to store data and mediate sharing. Thialfi is a notification service developed at Google to simplify this task. Thialfi supports applications written in multiple programming languages and running on multiple platforms, e.g., browsers, phones, and desktops. Applications register their interest in a set of shared objects and receive notifications when those objects change. Thialfi servers run in multiple Google data centers for availability and replicate their state asynchronously. Thialfi’s approach to recovery emphasizes simplicity: all server state is soft, and clients drive recovery and assist in replication. A principal goal of our design is to provide a straightforward API and good semantics despite a variety of failures, including server crashes, communication failures, storage unavailability, and data center failures.

Some notes: Version updates come from the application server and as a result the application server should maintain the new version number for the objects it shares with the clients. This implies that Thialfi imposes a requirement for versioning objects whereas in many cases versioning may not be even needed.

They consider only 10% of the clients to be online most of the time and measurements are taken for cases where the number of clients is not really high, an increase in the number of clients would change the metrics (drastically?).

Friday, February 17, 2012

Some interesting graph cut papers

Notes on graph cuts with submodular edge weights
http://users.cms.caltech.edu/~krausea/discml/papers/jegelka09subcuts.pdf

Optimization on Graphs with Variable
http://www.springerlink.com/content/g457x624814gm812/fulltext.pdf

Labelings of Graphs with Fixed and Variable Edge-Weights
http://epubs.siam.org/sidma/resource/1/sjdmec/v21/i3/p688_s1

Dynamic Graph Cuts for Efficient Inference in Markov Random Fields
http://research.microsoft.com/en-us/um/people/pkohli/papers/pami07.pdf

Some thoughts on Replicability

Replicability
  • Horizontal Partitioning is where cloud is more significantly helpful so if you manage to determine stateless components or come up with recommendations for stateless components, it is easier to identify whether or not we can do any horizontal partitioning without significant effects and modifications to the application.
  • Data is an issue, so let's collocated code and data

Graph Partitioning with Natural Cuts

link: http://research.microsoft.com/pubs/142349/punchtr.pdf
Daniel Delling, Andrew GOldberg, Ilya Razenshteyn, Renato F. Weneck

Abstract. We present a novel approach to graph partitioning based on the notion of natural
cuts. Our algorithm, called PUNCH, has two phases. The rst phase performs a
series of minimum-cut computations to identify and contract dense regions of the
graph. This reduces the graph size signi cantly, but preserves its general structure.
The second phase uses a combination of greedy and local search heuristics to assemble
the nal partition. The algorithm performs especially well on road networks, which
have an abundance of natural cuts (such as bridges, mountain passes, and ferries).
In a few minutes, it obtains the best known partitions for continental-sized networks,
signi cantly improving on previous results.

Thursday, February 16, 2012

A New Approach to the Minimum Cut Problem

http://www.columbia.edu/~cs2035/courses/ieor6614.S09/Contrac
tion.pdf

Friday, February 10, 2012

An Evaluation of Alternative Architectures for Transaction Processing in the Cloud

Donald Kossmann, Tim Kraska, Simon Loesing SIGMOD2010

http://www.cs.berkeley.edu/~kraska/pub/sigmod10-cloudbench.pdf

Cloud computing promises a number of advantages for the deployment
of data-intensive applications. One important promise
is reduced cost with a pay-as-you-go business model. Another
promise is (virtually) unlimited throughput by adding servers if
the workload increases. This paper lists alternative architectures
to effect cloud computing for database applications and reports on
the results of a comprehensive evaluation of existing commercial
cloud services that have adopted these architectures. The focus of
this work is on transaction processing (i.e., read and update workloads),
rather than analytics or OLAP workloads, which have recently
gained a great deal of attention. The results are surprising
in several ways. Most importantly, it seems that all major vendors
have adopted a different architecture for their cloud services. As a
result, the cost and performance of the services vary significantly
depending on the workload.

Wednesday, February 08, 2012

To Move or Not to Move: the economics of Cloud Computing

To Move or Not to Move: The Economics of Cloud Computing
Byung Chul Tak, Bhuvan Urgaonkar, and Anand Sivasubramaniam, The Pennsylvania State University
HotCloud 2011

http://www.usenix.org/events/hotcloud11/tech/final_files/Tak.pdf


Cloud-based hosting promises cost advantages over conventional in-house (on-premise) application deployment. One important question when considering a move to the cloud is whether it makes sense for 'my' application to migrate to the cloud. This question is challenging to answer due to following reasons. Although many potential benefits of migrating to the cloud can be enumerated, some benefits may not apply to 'my' application. Also, there can be multiple ways in which an application might make use of the facilities offered by cloud providers. Answering these questions requires an in-depth understanding of the cost implications of all the possible choices specific to 'my' circumstances. In this study We identify an initial set of key factors affecting the costs of a deployement choice. Using benchmarks representing two different applications (TPC-W and TPC-E) we investigate the evolution of costs for different deployment choices. We show that application characteristics such as workload intensity, growth rate, storage capacity and software licensing costs produce complex combined effect on overall costs. We also discuss issues regarding workload variance and horizontal partitioning.