Case Studies on Intra-Domain Routing Instability

24
Case Studies in Case Studies in Intra Intra - - Domain Routing Instability Domain Routing Instability Zhang Shu Zhang Shu National Institute of Information and National Institute of Information and Communications Technology, Japan Communications Technology, Japan NANOG31 NANOG31 San Francisco, 2004/5/25 San Francisco, 2004/5/25

Transcript of Case Studies on Intra-Domain Routing Instability

Case Studies inCase Studies inIntraIntra--Domain Routing InstabilityDomain Routing Instability

Zhang ShuZhang ShuNational Institute of Information and National Institute of Information and Communications Technology, JapanCommunications Technology, Japan

NANOG31NANOG31San Francisco, 2004/5/25San Francisco, 2004/5/25

OverviewOverviewIntraIntra--domain routing instabilitydomain routing instabilityMeasurements of intraMeasurements of intra--domain domain routing instabilityrouting instability•• WIDE Internet and APAN TokyoWIDE Internet and APAN Tokyo--XP XP

networknetwork

Dealing with intraDealing with intra--domain routing domain routing instabilityinstability•• Detection and troubleshootingDetection and troubleshooting

ConclusionsConclusions

IntraIntra--Domain Routing InstabilityDomain Routing InstabilityIntraIntra--domain routing instabilitydomain routing instability•• Unexpected routing changes within an IGP Unexpected routing changes within an IGP

routing domainrouting domain•• Causes packet loss, increased router load, Causes packet loss, increased router load,

and wasted bandwidthand wasted bandwidth

Why focus on intraWhy focus on intra--domain routing?domain routing?•• Compared with interCompared with inter--domain routing, domain routing,

research on IGP behaviors is still poorresearch on IGP behaviors is still poor•• Help operators better understand intraHelp operators better understand intra--

domain routing instability and learn how domain routing instability and learn how to deal with itto deal with it

Measurement MethodologyMeasurement MethodologyData collectionData collection•• OSPFOSPF•• TcpdumpTcpdump

Ethernet

OSPF cloud

Data collector

Measurement Methodology (ContMeasurement Methodology (Cont’’d)d)

Data analysisData analysis•• Counting routing changesCounting routing changes

Changes in the content of an LSAChanges in the content of an LSALSA flushLSA flushChanges in ASChanges in AS--External LSAs External LSAs were excludedwere excluded

•• Refreshing LSAs were not Refreshing LSAs were not countedcounted

Case Study 1/2: WIDE InternetCase Study 1/2: WIDE InternetWIDE InternetWIDE Internet•• WIDE Project (http://WIDE Project (http://www.wide.ad.jpwww.wide.ad.jp))•• Connects hundreds of academic Connects hundreds of academic

organizationsorganizations•• About 50 routers in the OSPF backbone About 50 routers in the OSPF backbone

areaarea

Data collected at NARAData collected at NARA--NOCNOC•• Located in Nara, JapanLocated in Nara, Japan•• Both OSPFv2 and OSPFv3 data collectedBoth OSPFv2 and OSPFv3 data collected

Measurement of the WIDE Internet Measurement of the WIDE Internet RouterRouter--LSALSA

Period: August 2000 – May 2004

Measurement of the WIDE Internet (ContMeasurement of the WIDE Internet (Cont’’d)d)

Network-LSA

Network-Summary-

LSA

ASBR-Summary-

LSA

Period: August 2000 – May 2004

Example of a Typical LSA OscillationExample of a Typical LSA Oscillation

Relatively frequent changes in short term• A router in Fukuoka (WIDE),

5/7/2004, lasted for about 4 hoursUsually caused by congestion

Example of Serious OscillationExample of Serious Oscillation

Frequent changes in short termFrequent changes in short term• An L3 switch, 6/12/03-6/13/03, lasted for

about 18 hoursObserved for several times• Most of them were caused by problems of

p2p links or misconfiguration of using the same router ID on two routers

LongLong--Term ChangesTerm Changes

Relatively frequent changes• A router in SF, lasted for 5 months (10/23/03-4/1/04)

Considered due to a switch problem

LongLong--Term Changes (ContTerm Changes (Cont’’d)d)

Slow changesSlow changes• A router in Kyoto, has persisted since

this MarchSome of them were caused by interface problems

The Case of OSPFv3The Case of OSPFv3

Period: July 2003 – January 2004

Case Study 2/2: APAN TokyoCase Study 2/2: APAN Tokyo--XPXP

APAN TokyoAPAN Tokyo--XP networkXP network•• A transit network located in TokyoA transit network located in Tokyo•• Relatively small in scale, with no Relatively small in scale, with no

more than ten routers in the more than ten routers in the backbone areabackbone area

Measurement of APAN TokyoMeasurement of APAN Tokyo--XP NetworkXP Network(OSPFv2, Router(OSPFv2, Router--LSA)LSA)

Problem of ATM link

Switch problemMisconfiguration

Period: August 2003 – May 2004

Causes of InstabilityCauses of InstabilityIdentified causesIdentified causes•• CongestionCongestion

DDoSDDoS

•• Link failureLink failure•• Software/Hardware bugSoftware/Hardware bug•• MisconfigurationMisconfiguration

Most instability is due to other Most instability is due to other reasons rather than routing protocol reasons rather than routing protocol problemsproblems

Analysis ResultsAnalysis Results

Observed Routing InstabilityObserved Routing Instability•• Instability observed on both the Instability observed on both the WIDE Internet and the APAN TokyoWIDE Internet and the APAN Tokyo--XP networkXP network

•• The most typical changes are The most typical changes are relatively frequent shortrelatively frequent short--term onesterm ones

Happen at intervals of 10 Happen at intervals of 10 -- 200s200s

•• Frequent shortFrequent short--term changesterm changes•• LongLong--term changesterm changes

Analysis Results (ContAnalysis Results (Cont’’d)d)

Changes is decreasingChanges is decreasing•• The change in routerThe change in router’’s implementations implementation•• Less network congestion because of the Less network congestion because of the

increased bandwidth in recent yearsincreased bandwidth in recent years

The causes of many changes are The causes of many changes are unknownunknown

Rtanaly: A Tool to Detect and Visualize Rtanaly: A Tool to Detect and Visualize IntraIntra--Domain Routing InstabilityDomain Routing Instability

FunctionsFunctions•• Detection of IGP change in realDetection of IGP change in real--time time

and alert operatorsand alert operatorsCan also be used for offline data analysisCan also be used for offline data analysis

•• VisualizationVisualization•• Accessible through the WWW interfaceAccessible through the WWW interface

Currently only supports OSPFCurrently only supports OSPF•• ISIS--IS support will be completed soonIS support will be completed soon

Troubleshooting Routing InstabilityTroubleshooting Routing Instability

Why is routing instability Why is routing instability troubleshooting difficult?troubleshooting difficult?•• Problems occur intermittently, so it is Problems occur intermittently, so it is

difficult to get useful datadifficult to get useful data for for troubleshootingtroubleshooting

EventEvent--driven data collectiondriven data collection•• Automatically obtain data for Automatically obtain data for

troubleshooting when detecting routing troubleshooting when detecting routing changeschanges

Troubleshooting Routing Instability (ContTroubleshooting Routing Instability (Cont’’d)d)

Data that should be collectedData that should be collected•• Traffic volumeTraffic volume•• Interface statusInterface status•• Information on the routing protocolsInformation on the routing protocols

From where?From where?•• The router that originated the changing The router that originated the changing

LSALSA•• Network equipment connected to the Network equipment connected to the

routerrouterSwitchSwitch

How to collect the data?How to collect the data?•• SNMPSNMP

ConclusionsConclusionsRouting instability measurementsRouting instability measurements•• IntraIntra--domain routing instability can domain routing instability can occur frequently and persistentlyoccur frequently and persistently

•• Similar phenomenon may occur on Similar phenomenon may occur on other networksother networks

It is important to deploy a monitoring It is important to deploy a monitoring system on your own networksystem on your own network

RtanalyRtanalyTroubleshootingTroubleshooting•• EventEvent--driven data collectiondriven data collection

Acknowledgements

My thanks toMy thanks to•• WIDE Project and Nara Institute of WIDE Project and Nara Institute of

Science and TechnologyScience and Technology•• Operators of APAN TokyoOperators of APAN Tokyo--XP networkXP network•• Prof. Prof. YoukiYouki KadobayashiKadobayashi for the idea on for the idea on

troubleshootingtroubleshooting

IntraIntra--domain routing stability domain routing stability measurement projectmeasurement project• http://pe0.koganei.wide.ad.jp/rtanaly

Please contact us if you are Please contact us if you are interested in conducting an IGP interested in conducting an IGP measurement on your networkmeasurement on your network•• [email protected]@koganei.wide.ad.jp

Thank you!Thank you!