Campus Event Calendar: Florin Dinu (04/23/2013 in E1 5/029)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

What and Who

Understanding and Improving the Efficiency of Failure Resilience for Big Data Frameworks

Florin Dinu

Rice University

SWS Colloquium

Florin Dinu is a final year graduate student in the Systems Group at Rice University, Houston, TX. He is advised by Prof. T. S. Eugene Ng.
Before joining Rice in 2007, he received a B.A. in Computer Science from Politehnica University Bucharest in 2006 and then worked as a
junior researcher at the Fokus Fraunhofer Institute in Berlin, Germany. His Ph.D. dissertation focuses on the efficiency of failure resilience
in big data processing frameworks. He has also done work on the benefits of centralized network control, congestion inference and
improving data transfers for big data computations.

SWS

AG Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Tuesday, 23 April 2013

10:30

90 Minutes

E1 5

029

Saarbrücken

Abstract

Big data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today. A strong selling point is their ability to provide failure resilience guarantees. They can run computations to completion despite occasional failures in the system. However, an overlooked point
has been the efficiency of the failure resilience provided. The vision of this work is that big data frameworks should not only finish computations under failures but minimize the impact of the failures on the computation time.

The first part of the talk presents the first in-depth analysis of the efficiency of the failure resilience provided by the popular Hadoop framework at the level of a single job. The results show that compute node failures can lead to variable and unpredictable job running times.
The causes behind these results are detailed in the talk. The second part of the talk focuses on providing failure resilience at the level of multi-job computations. It presents the design, implementation and evaluation of RCMP, a MapReduce system based on the fundamental insight
that using replication as the main failure resilience strategy oftentimes leads to significant and unnecessary increases in computation running time. In contrast, RCMP is designed to use job re-computation as a first-order failure resilience strategy. RCMP enables re-computations that perform the minimum amount of work and also maximizes the efficiency of the re-computation work that still needs to be performed.

Contact

Claudia Richter

9303 9103

--email hidden

Video Broadcast

Video Broadcast:

Yes

To Location:

Kaiserslautern

To Building:

G26

To Room:

206

Cisco Meeting ID:

System used:

Meeting URL:

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Claudia Richter, 04/17/2013 09:28 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis