Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

MPI-INF or MPI-SWS or Local Campus Event Calendar

<< Previous Entry Next Entry >> New Event Entry Edit this Entry Login to DB (to update, delete)
What and Who
Title:Combinatorial Constructions for Effective Testing
Speaker:Filip Nikšic
coming from:Max Planck Institute for Software Systems
Speakers Bio:
Event Type:SWS Student Defense Talks - Thesis Defense
Visibility:SWS
We use this to send out email in the morning.
Level:Public Audience
Language:English
Date, Time and Location
Date:Friday, 3 May 2019
Time:15:30
Duration:75 Minutes
Location:Kaiserslautern
Building:G26
Room:111
Abstract
Large-scale distributed systems consist of a number of components, take a number of

parameter values as input, and behave differently based on a number of non-deterministic

events. All these features—components, parameter values, and events—interact in com-

plicated ways, and unanticipated interactions may lead to bugs. Empirically, many bugs

in these systems are caused by interactions of only a small number of features. In certain

cases, it may be possible to test all interactions of k features for a small constant k by

executing a family of tests that is exponentially or even doubly-exponentially smaller

than the family of all tests. Thus, in such cases we can effectively uncover all bugs that

require up to k-wise interactions of features.

In this thesis we study two occurrences of this phenomenon. First, many bugs in

distributed systems are caused by network partition faults. In most cases these bugs occur

due to two or three key nodes, such as leaders or replicas, not being able to communicate,

or because the leading node finds itself in a block of the partition without quorum.

Second, bugs may occur due to unexpected schedules (interleavings) of concurrent events—

concurrent exchange of messages and concurrent access to shared resources. Again, many

bugs depend only on the relative ordering of a small number of events. We call the smallest

number of events whose ordering causes a bug the depth of the bug. We show that in

both testing scenarios we can effectively uncover bugs involving small number of nodes

or bugs of small depth by executing small families of tests.

We phrase both testing scenarios in terms of an abstract framework of tests, testing

goals, and goal coverage. Sets of tests that cover all testing goals are called covering

families. We give a general construction that shows that whenever a random test covers

a fixed goal with sufficiently high probability, a small randomly chosen set of tests is

a covering family with high probability. We then introduce concrete coverage notions

relating to network partition faults and bugs of small depth. In case of network partition

faults, we show that for the introduced coverage notions we can find a lower bound on the

probability that a random test covers a given goal. Our general construction then yields a

randomized testing procedure that achieves full coverage—and hence, find bugs—quickly.

In case of coverage notions related to bugs of small depth, if the events in the program

form a non-trivial partial order, our general construction may give a suboptimal bound.

Thus, we study other ways of constructing convering families. We show that if the events

in a concurrent program are partially ordered as a tree, we can explicitly construct a

covering family of small size: for balanced trees, our construction is polylogarithmic in

the number of events. For the case when the partial order of events does not have a “nice”

structure, and the events and their relation to previous events are revealed while the

program is running, we give an online construction of covering families. Based on the

construction, we develop a randomized scheduler called PCTCP that uniformly samples

schedules from a covering family and has a rigorous guarantee of finding bugs of small

depth. We experiment with an implementation of PCTCP on two real-world distributed

systems—Zookeeper and Cassandra—and show that it can effectively find bugs.

Contact
Name(s):
Video Broadcast
Video Broadcast:YesTo Location:Saarbrücken
To Building:E1 5To Room:029
Meeting ID:SWS Space 2 (6312)
Tags, Category, Keywords and additional notes
Note:
Attachments, File(s):

Created:
Maria-Louise Albrecht/MPI-KLSB, 04/11/2019 03:00 PM
Last modified:
Maria-Louise Albrecht/MPI-KLSB, 04/11/2019 03:17 PM
  • Maria-Louise Albrecht, 04/11/2019 03:17 PM -- Created document.