Measuring the Effects of Internet Path Faults on Reactive Routing

Nick Feamster, David G. Andersen, Hari Balakrishnan, M. Frans Kaashoek
ACM SIGMETRICS, San Diego, CA, June 2003.

Empirical evidence suggests that reactive routing systems improve resilience to Internet path failures. They detect and route around faulty paths based on measurements of path performance. This paper seeks to understand why and under what circumstances these techniques are effective.

To do so, this paper correlates end-to-end active probing experiments, loss-triggered traceroutes of Internet paths, and BGP routing messages. These correlations shed light on three questions about Internet path failures: (1) Where do failures appear? (2) How long do they last? (3) How do they correlate with BGP routing instability?

Data collected over 13 months from an Internet testbed of 31 topologically diverse hosts suggests that most path failures last less than fifteen minutes. Failures that appear in the network core correlate better with BGP instability than failures that appear close to end hosts. On average, most failures precede BGP messages by about four minutes, but there is often increased BGP traffic both before and after failures. Our findings suggest that reactive routing is most effective between hosts that have multiple connections to the Internet. The data set also suggests that passive observations of BGP routing messages could be used to predict about 20% of impending failures, allowing re-routing systems to react more quickly to failures.

[PostScript (1.2MB)] [Gzipped PostScript (232KB)] [PDF (445KB)]