Discussion:
[c-nsp] ASR9k: RIB/FIB convergence
Thomas Schmid
2018-08-02 09:13:32 UTC
Permalink
Hi all,

sort of a heads up ...

I'd be interested to hear if, and under which circumstances others are seeing this behavior,
since the root cause is still unknown.

In the beginning there were some anecdotical complaints
by customers that they experienced persistent reachability problems to some destinations
when we did a scheduled maintenance in our network somewhere else. Further
investigations pointed to routing inconsistencies during large RIB changes.

To give you some numbers: we found out that in our environment processing 70k BGP changes
takes 2-3 min to write the updates to FIB, 700k routes takes 20-30 min!!

During that period, RIB and FIB are not consistent with all the nasty consequences:
blackholing, routing loops etc.

Convergence time seems to be somehow related to the number of eBGP sessions on the
box. On routers with less than 200 sessions, convergence time looks ok, from 300+
sessions on, things get bad.

This affects both XR 5.3.3, 6.2.3 and Typhoon, Tomahawk linecards.

TAC/BU are currently working on this, but they have a hard time to find out what's
going wrong here. Processing the updates on the RP takes less than 1s,
but writing the updates to the LC takes forever ...

Thanks,

Thomas
a***@netconsultings.com
2018-08-02 10:23:04 UTC
Permalink
Thomas Schmid
Sent: Thursday, August 02, 2018 10:14 AM
Hi all,
sort of a heads up ...
I'd be interested to hear if, and under which circumstances others are seeing
this behavior, since the root cause is still unknown.
In the beginning there were some anecdotical complaints by customers that
they experienced persistent reachability problems to some destinations
when we did a scheduled maintenance in our network somewhere else.
Further investigations pointed to routing inconsistencies during large RIB
changes.
To give you some numbers: we found out that in our environment
processing 70k BGP changes takes 2-3 min to write the updates to FIB, 700k
routes takes 20-30 min!!
blackholing, routing loops etc.
Convergence time seems to be somehow related to the number of eBGP
sessions on the box. On routers with less than 200 sessions, convergence
time looks ok, from 300+ sessions on, things get bad.
This affects both XR 5.3.3, 6.2.3 and Typhoon, Tomahawk linecards.
TAC/BU are currently working on this, but they have a hard time to find out
what's going wrong here. Processing the updates on the RP takes less than
1s, but writing the updates to the LC takes forever ...
First thing first,
To mitigate the damage due to RIB-FIB inconsistencies you could use the:
"BGP-RIB Feedback Mechanism for Update Generation"
"To configure BGP to wait for feedback from RIB indicating that the routes that BGP installed in RIB are installed in FIB, before BGP sends out updates to neighbors, use the "update wait-install" command in router address-family IPv4 or router address-family VPNv4 configuration mode."


Are you seeing any log messages indicating bottleneck between RIB and FIB please?
Do you drop BGP updates on ingress with "as-path length ge 51" please? -not only it's a good practice, but apparently long as-paths caused RIB-FIB clogging in the past.

On your note regarding the apparent relation to number of peers.
So how long does it take for the process to complete for the 200 peers nodes is it linearly proportional to the 20-30 minutes seen on 300 peers nodes please?
Or the relation between number of peers and time follows more of an exponential function (e.g. 290 all good and then 301 bang 30min) , in which case that could also indicate something special with those "delta" peers (e.g. some peers sending somewhat funky updates) (any slow peers btw?)


adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::


_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Thomas Schmid
2018-08-02 11:11:40 UTC
Permalink
Hi Adam,
Post by a***@netconsultings.com
First thing first,
"BGP-RIB Feedback Mechanism for Update Generation"
"To configure BGP to wait for feedback from RIB indicating that the routes that BGP installed in RIB are installed in FIB, before BGP sends out updates to neighbors, use the "update wait-install" command in router address-family IPv4 or router address-family VPNv4 configuration mode."
good point. Haven't heard of this from Cisco yet. Will discuss this with the TAC.
Post by a***@netconsultings.com
Are you seeing any log messages indicating bottleneck between RIB and FIB please?
no.
Post by a***@netconsultings.com
Do you drop BGP updates on ingress with "as-path length ge 51" please? -not only it's a good practice, but apparently long as-paths caused RIB-FIB clogging in the past.
On your note regarding the apparent relation to number of peers.
So how long does it take for the process to complete for the 200 peers nodes is it linearly proportional to the 20-30 minutes seen on 300 peers nodes please?> Or the relation between number of peers and time follows more of an exponential function (e.g. 290 all good and then 301 bang 30min) , in which case that could also indicate something special with those "delta" peers (e.g. some peers sending somewhat funky updates) (any slow peers btw?)
to be more clear: the full 700k BGP updates are only sent to a small fraction of the e/iBGP peers (10-20).

The BGP updates are sent out without delay to the neighbors. Wrt. the number of sessions when things get bad, it's hard to tell since the number of routers is 8 with 4 ASR9000 and 4 ASR9900 (many BGP peers). Within the 4 ASR9900 it looks more or less linear.

Thanks,

Thomas
Sebastian Neuner
2018-08-02 17:19:25 UTC
Permalink
Hi Thomas,

we have seen similar effects in the past. I remember a case, where a
router with Trident cards and 4.3.1 (and newer routers around it) got
stuck in a situation similar to yours. It even tried to forward packets
to a port that was admin-down.
Post by a***@netconsultings.com
Do you drop BGP updates on ingress with "as-path length ge 51" please? -not only it's a good practice, but apparently long as-paths caused RIB-FIB clogging in the past.
This fixed our problem. After a whole night of debugging, I found this
mail thread, "[c-nsp] CEF issues this weekend".

Some AS announced a prefix and prepended >500 times.

Since then, we filter for as-path-length on ingress everywhere and
haven't seen this behavior again.

Best regards,
Sebastian
_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
a***@netconsultings.com
2018-08-02 17:52:24 UTC
Permalink
Sebastian Neuner
Sent: Thursday, August 02, 2018 6:19 PM
To: 'Cisco Network Service Providers'
Subject: Re: [c-nsp] ASR9k: RIB/FIB convergence
Hi Thomas,
we have seen similar effects in the past. I remember a case, where a
router
with Trident cards and 4.3.1 (and newer routers around it) got stuck in a
situation similar to yours. It even tried to forward packets to a port
that was
admin-down.
Post by a***@netconsultings.com
Do you drop BGP updates on ingress with "as-path length ge 51" please? -
not only it's a good practice, but apparently long as-paths caused RIB-FIB
clogging in the past.
This fixed our problem. After a whole night of debugging, I found this
mail
thread, "[c-nsp] CEF issues this weekend".
Some AS announced a prefix and prepended >500 times.
Since then, we filter for as-path-length on ingress everywhere and haven't
seen this behavior again.
Yup I remember that one very well.
Came in fairly quick succession (though not sure which one was first) to the
incident where some university advertised a prefix with some custom bgp
attribute and forgot to tell the world until it was too late.
I guess these two incidents then resulted in the long and painful road to
RFC 7606 - Revised Error Handling for BGP UPDATE Messages with various
success among vendors:
Good: %ROUTING-BGP-3-MALFORM_UPDATE : Malformed UPDATE message received from
neighbor x.x.x.x (VRF: INTERNET) - message length 103 bytes, error flags
0x00400000, action taken "DiscardAttr"
Bad: When the 'bgp-error-tolerance' feature - designed to help mitigate
remote session resets from malformed path attributes - is enabled, a BGP
UPDATE containing a specifically crafted set of transitive attributes can
cause the RPD routing process to crash and restart.

Also these are the reasons why I always recommend building a separate RRs
infrastructure (or plane) dedicated to carry internet prefixes -and keep it
separate from the RR infrastructure carrying prefixes for VPN services.

adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::

_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
a***@netconsultings.com
2018-08-02 18:02:47 UTC
Permalink
Sent: Thursday, August 02, 2018 12:12 PM
Hi Adam,
Post by a***@netconsultings.com
First thing first,
"BGP-RIB Feedback Mechanism for Update Generation"
"To configure BGP to wait for feedback from RIB indicating that the routes
that BGP installed in RIB are installed in FIB, before BGP sends out updates to
neighbors, use the "update wait-install" command in router address-family
IPv4 or router address-family VPNv4 configuration mode."
good point. Haven't heard of this from Cisco yet. Will discuss this with the TAC.
Post by a***@netconsultings.com
Are you seeing any log messages indicating bottleneck between RIB and
FIB please?
no.
Post by a***@netconsultings.com
Do you drop BGP updates on ingress with "as-path length ge 51" please? -
not only it's a good practice, but apparently long as-paths caused RIB-FIB
clogging in the past.
Post by a***@netconsultings.com
On your note regarding the apparent relation to number of peers.
So how long does it take for the process to complete for the 200 peers
nodes is it linearly proportional to the 20-30 minutes seen on 300 peers
nodes please?> Or the relation between number of peers and time follows
more of an exponential function (e.g. 290 all good and then 301 bang 30min) ,
in which case that could also indicate something special with those "delta"
peers (e.g. some peers sending somewhat funky updates) (any slow peers
btw?)
to be more clear: the full 700k BGP updates are only sent to a small fraction
of the e/iBGP peers (10-20).
The BGP updates are sent out without delay to the neighbors. Wrt. the
number of sessions when things get bad, it's hard to tell since the number of
routers is 8 with 4 ASR9000 and 4 ASR9900 (many BGP peers). Within the 4
ASR9900 it looks more or less linear.
Seems like the control plane is all good it's just the bcdlv2 having troubles - or GSP as the underlying transport.
The " show bcdlv2 trace" should give TAC engineers tons of info.
Btw if you do "show rib update-groups" you see no freeze counts right?


adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::

_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Thomas Schmid
2018-08-03 06:44:49 UTC
Permalink
Hi Adam,
Post by a***@netconsultings.com
Post by Thomas Schmid
to be more clear: the full 700k BGP updates are only sent to a small fraction
of the e/iBGP peers (10-20).
The BGP updates are sent out without delay to the neighbors. Wrt. the
number of sessions when things get bad, it's hard to tell since the number of
routers is 8 with 4 ASR9000 and 4 ASR9900 (many BGP peers). Within the 4
ASR9900 it looks more or less linear.
Seems like the control plane is all good it's just the bcdlv2 having troubles - or GSP as the underlying transport.
The " show bcdlv2 trace" should give TAC engineers tons of info.
yes, I already had like 10h of webex with TAC/BU/DEs, but the show bcdlv2 trace doesn't give enough info so they had to build a special debug SMU that could only be used in lab environment. That's where we are at the moment.
Post by a***@netconsultings.com
Btw if you do "show rib update-groups" you see no freeze counts right?
actually there's one entry:

Update Group Client ID Parent UG Redist Freeze Count
3 0 0 outsync 0
2 10 N/A insync 0
0 22 N/A insync 0
1 23 N/A insync 0
28 43 1 outsync 1

Cheers,

Thomas
Thomas Schmid
2018-08-03 09:08:17 UTC
Permalink
Hi,
Post by Thomas Schmid
Hi Adam,
Post by a***@netconsultings.com
First thing first,
"BGP-RIB Feedback Mechanism for Update Generation"
"To configure BGP to wait for feedback from RIB indicating that the routes that BGP installed in RIB are installed in FIB, before BGP sends out updates to neighbors, use the "update wait-install" command in router address-family IPv4 or router address-family VPNv4 configuration mode."
good point. Haven't heard of this from Cisco yet. Will discuss this with the TAC.
giving it a second thought: this may help in some cases, in others not. E.g. BGP link to upstream dies -> FIB is still pointing to upstream -> router is still announcing himself as exit point until all FIB entries are updated -> traffic is dropped.

It may help if e.g. as-path length is changing. The routing will then be suboptimal for a while, but the rate with which the updates are announced to the neighboring routers should be such that they can sync their FIB in time.

Waiting for feedback from TAC if this command is indeed checking the LC FIB or if it's just looking at the RP.

Cheers,

Thomas
a***@netconsultings.com
2018-08-03 13:46:09 UTC
Permalink
Thomas Schmid
Sent: Friday, August 03, 2018 10:08 AM
Subject: Re: [c-nsp] ASR9k: RIB/FIB convergence
Hi,
Post by Thomas Schmid
Hi Adam,
Post by a***@netconsultings.com
First thing first,
"BGP-RIB Feedback Mechanism for Update Generation"
"To configure BGP to wait for feedback from RIB indicating that the routes
that BGP installed in RIB are installed in FIB, before BGP sends out updates to
neighbors, use the "update wait-install" command in router address-family
IPv4 or router address-family VPNv4 configuration mode."
Post by Thomas Schmid
good point. Haven't heard of this from Cisco yet. Will discuss this with the
TAC.
giving it a second thought: this may help in some cases, in others not. E.g.
BGP link to upstream dies -> FIB is still pointing to upstream -> router is still
announcing himself as exit point until all FIB entries are updated -> traffic is
dropped.
It may help if e.g. as-path length is changing. The routing will then be
suboptimal for a while, but the rate with which the updates are announced
to the neighboring routers should be such that they can sync their FIB in
time.
Waiting for feedback from TAC if this command is indeed checking the LC FIB
or if it's just looking at the RP.
Hmm,
Very good point, but if this feature should be of any use then it should be associated only with "ADD'' operation, possibly with "UPDATE" operation, but certainly not with "DELETE" operation.
(not mentioning it should get the state back from the LC's NP HW-FIB or at least LC's CPU SW-FIB)

adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::


_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Thomas Schmid
2018-08-10 09:35:19 UTC
Permalink
Hi,
Post by a***@netconsultings.com
Post by Thomas Schmid
giving it a second thought: this may help in some cases, in others not. E.g.
BGP link to upstream dies -> FIB is still pointing to upstream -> router is still
announcing himself as exit point until all FIB entries are updated -> traffic is
dropped.
It may help if e.g. as-path length is changing. The routing will then be
suboptimal for a while, but the rate with which the updates are announced
to the neighboring routers should be such that they can sync their FIB in
time.
Waiting for feedback from TAC if this command is indeed checking the LC FIB
or if it's just looking at the RP.
Hmm,
Very good point, but if this feature should be of any use then it should be associated only with "ADD'' operation, possibly with "UPDATE" operation, but certainly not with "DELETE" operation.
(not mentioning it should get the state back from the LC's NP HW-FIB or at least LC's CPU SW-FIB)
it turns out you run into funny situations with that 'update wait-install' command enabled:

RTA: 'update wait-install' configured. Learns route a.b.c.0/24 from direct peer ISPA.

RTB: learns a.b.c.0/24 from eBGP peer ISPB with longer AS-path.

RTA, RTB in BGP full mesh.

If I prepend the route for a.b.c.0/24 on RTA, the local BGP table is updated, but
the announcement with the longer as-path *never* makes it to RTB, probably because
the CEF entry locally is not updated and does not change. So RTA is holding back the
BGP announcement of the longer route to his neighbors.

So RTB never sees the longer as-path for the prefix and therefore *never* announces
the shorter route via back to RTA. Therefore the routing never changes in the network.

In addition: 5.3.3 has bug CSCuv02045 "Mutex in ipv4_rib/ipv6_rib when
update-wait-install is enabled" ...

Cheers,

Thomas
Bryan Holloway
2018-08-13 18:39:57 UTC
Permalink
Post by Thomas Schmid
Hi,
Post by a***@netconsultings.com
Post by Thomas Schmid
giving it a second thought: this may help in some cases, in others not. E.g.
BGP link to upstream dies -> FIB is still pointing to upstream -> router is still
announcing himself as exit point until all FIB entries are updated -> traffic is
dropped.
It may help if e.g. as-path length is changing. The routing will then be
suboptimal for a while, but the rate with which the updates are announced
to the neighboring routers should be such that they can sync their FIB in
time.
Waiting for feedback from TAC if this command is indeed checking the LC FIB
or if it's just looking at the RP.
Hmm,
Very good point, but if this feature should be of any use then it should be associated only with "ADD'' operation, possibly with "UPDATE" operation, but certainly not with "DELETE" operation.
(not mentioning it should get the state back from the LC's NP HW-FIB or at least LC's CPU SW-FIB)
RTA: 'update wait-install' configured. Learns route a.b.c.0/24 from direct peer ISPA.
RTB: learns a.b.c.0/24 from eBGP peer ISPB with longer AS-path.
RTA, RTB in BGP full mesh.
If I prepend the route for a.b.c.0/24 on RTA, the local BGP table is updated, but
the announcement with the longer as-path *never* makes it to RTB, probably because
the CEF entry locally is not updated and does not change. So RTA is holding back the
BGP announcement of the longer route to his neighbors.
So RTB never sees the longer as-path for the prefix and therefore *never* announces
the shorter route via back to RTA. Therefore the routing never changes in the network.
In addition: 5.3.3 has bug CSCuv02045 "Mutex in ipv4_rib/ipv6_rib when
update-wait-install is enabled" ...
Cheers,
Thomas
We've also had operational issues enabling "update wait-install" running
5.3.4 SP8 on a 9006 w/ RSP880s.

Lots of ROUTING-BGP-4-SLOW_FEEDBACK and
ROUTING-BGP-4-FEEDBACK_OUT_OF_ORDER followed by BGP sessions just
dropping and us having to reset IGP and BGP processes.

Finally we just turned it off. Our convergence times just aren't that
terrible to warrant enabling it.

_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
a***@netconsultings.com
2018-08-14 09:01:08 UTC
Permalink
Thomas Schmid
Sent: Friday, August 10, 2018 10:35 AM
Hi,
Post by a***@netconsultings.com
Post by Thomas Schmid
giving it a second thought: this may help in some cases, in others not. E.g.
BGP link to upstream dies -> FIB is still pointing to upstream ->
router is still announcing himself as exit point until all FIB
entries are updated -> traffic is dropped.
It may help if e.g. as-path length is changing. The routing will then
be suboptimal for a while, but the rate with which the updates are
announced to the neighboring routers should be such that they can
sync their FIB in time.
Waiting for feedback from TAC if this command is indeed checking the
LC FIB or if it's just looking at the RP.
Hmm,
Very good point, but if this feature should be of any use then it should be
associated only with "ADD'' operation, possibly with "UPDATE" operation, but
certainly not with "DELETE" operation.
Post by a***@netconsultings.com
(not mentioning it should get the state back from the LC's NP HW-FIB
or at least LC's CPU SW-FIB)
RTA: 'update wait-install' configured. Learns route a.b.c.0/24 from direct peer ISPA.
RTB: learns a.b.c.0/24 from eBGP peer ISPB with longer AS-path.
RTA, RTB in BGP full mesh.
If I prepend the route for a.b.c.0/24 on RTA, the local BGP table is updated,
but the announcement with the longer as-path *never* makes it to RTB,
probably because the CEF entry locally is not updated and does not change.
So RTA is holding back the BGP announcement of the longer route to his
neighbors.
So RTB never sees the longer as-path for the prefix and therefore *never*
announces the shorter route via back to RTA. Therefore the routing never
changes in the network.
In addition: 5.3.3 has bug CSCuv02045 "Mutex in ipv4_rib/ipv6_rib when
update-wait-install is enabled" ...
Sorry for the late response,

Well you've got to be careful here,
You haven’t stated that, during initial conditions, A believed that path via ISPA has been selected as the overall best path indeed.
Cause if A believed that route via B from ISPB is the overall best path ,then A would not advertise its own route to B (unless A is configured with "advertise best-external" which I recommend in order to speed up convergence -please note though it increases FIB usage).

If, during initial conditions, A did select path via ISPA as the overall best path then I agree it’s a bug and should be reported.
(would be interesting to see from debug why the route is not advertised to B)
Also I believe that update-wait-install routine should not be used in this scenario (should be used only when a path is changed to best path).

Seems like CSCuv02045 has been fixed only very recently ~6.2.2+
Isn't there a SMU available for older releases?

Oh and one last note,
Alternative solution is BGP PIC Edge with "advertise best-external" - this combo covers all the cases with exception of a new (unforeseen) path that is relayed to other speakers or reload of the box -for these two cases the only solution is "update-wait-install".


adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::


_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisc
Thomas Schmid
2018-08-21 14:29:02 UTC
Permalink
Hi,
Post by a***@netconsultings.com
Thomas Schmid
Sent: Friday, August 10, 2018 10:35 AM
it turns out you run into funny situations with that 'update wait-install'
RTA: 'update wait-install' configured. Learns route a.b.c.0/24 from direct peer ISPA.
RTB: learns a.b.c.0/24 from eBGP peer ISPB with longer AS-path.
RTA, RTB in BGP full mesh.
If I prepend the route for a.b.c.0/24 on RTA, the local BGP table is updated,
but the announcement with the longer as-path *never* makes it to RTB,
probably because the CEF entry locally is not updated and does not change.
So RTA is holding back the BGP announcement of the longer route to his
neighbors.
So RTB never sees the longer as-path for the prefix and therefore *never*
announces the shorter route via back to RTA. Therefore the routing never
changes in the network.
In addition: 5.3.3 has bug CSCuv02045 "Mutex in ipv4_rib/ipv6_rib when
update-wait-install is enabled" ...
Sorry for the late response,
Well you've got to be careful here,
You haven’t stated that, during initial conditions, A believed that path via ISPA has been selected as the overall best path indeed.
Cause if A believed that route via B from ISPB is the overall best path ,then A would not advertise its own route to B (unless A is configured with "advertise best-external" which I recommend in order to speed up convergence -please note though it increases FIB usage).
If, during initial conditions, A did select path via ISPA as the overall best path then I agree it’s a bug and should be reported.
(would be interesting to see from debug why the route is not advertised to B)
Also I believe that update-wait-install routine should not be used in this scenario (should be used only when a path is changed to best path).
according to TAC, the behavior observed is intended behavior. No change in the CEF table -> no BGP update announcement. This leads to a classical deadlock situation. 'advertise best-external' might indeed help, but as you said, FIB usage goes up a lot when you do this for e.g. upstream connections.
Post by a***@netconsultings.com
Seems like CSCuv02045 has been fixed only very recently ~6.2.2+
Isn't there a SMU available for older releases?
no SMU for 5.3.3

Cheers,

Thomas
Gert Doering
2018-08-21 18:59:10 UTC
Permalink
Hi,
Post by Thomas Schmid
according to TAC, the behavior observed is intended behavior. No change in the CEF table -> no BGP update announcement. This leads to a classical deadlock situation. 'advertise best-external' might indeed help, but as you said, FIB usage goes up a lot when you do this for e.g. upstream connections.
Note that advertise best-external is only supported if you do
labeled-unicast.

If you run your "Internet" unlabeled, advertise best-external will do
fairly insane things which according to TAC are "works as designed"
(namely, install both the "best iBGP" and the "best eBGP" path in the
FIB and do load-sharing(!) across them).

gert
--
"If was one thing all people took for granted, was conviction that if you
feed honest figures into a computer, honest figures come out. Never doubted
it myself till I met a computer with a sense of humor."
Robert A. Heinlein, The Moon is a Harsh Mistress

Gert Doering - Munich, Germany ***@greenie.muc.de
Jason Lixfeld
2018-08-02 10:35:09 UTC
Permalink
Hi,

I don’t have that many sessions on any one of my 9Ks, but...
Post by Thomas Schmid
To give you some numbers: we found out that in our environment processing 70k BGP changes
takes 2-3 min to write the updates to FIB, 700k routes takes 20-30 min!!
How are you able to see this in the wild? Simply look at the CEF sum on each LC/RP at some interval?
Post by Thomas Schmid
TAC/BU are currently working on this, but they have a hard time to find out what's
going wrong here. Processing the updates on the RP takes less than 1s,
but writing the updates to the LC takes forever ...
Can you clarify? Do you mean it takes less than 1s for the RP to successfully feed it’s own FIB after the above topology change (which I assume to be the 700k route topology change), or…?
_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
arch
Thomas Schmid
2018-08-02 11:21:43 UTC
Permalink
Hi Jason,
Hi,
I don’t have that many sessions on any one of my 9Ks, but...
Post by Thomas Schmid
To give you some numbers: we found out that in our environment processing 70k BGP changes
takes 2-3 min to write the updates to FIB, 700k routes takes 20-30 min!!
How are you able to see this in the wild? Simply look at the CEF sum on each LC/RP at some interval?
compare 'sh route a.b.c.d/n' with 'sh cef a.b.c.d/n'. It looks like the low ip ranges are processed first (TAC confirmed this), so looking at a high ip number shows it best.
Post by Thomas Schmid
TAC/BU are currently working on this, but they have a hard time to find out what's
going wrong here. Processing the updates on the RP takes less than 1s,
but writing the updates to the LC takes forever ...
Can you clarify? Do you mean it takes less than 1s for the RP to successfully feed it’s own FIB after the above topology change (which I assume to be the 700k route topology change), or
?
the BU found, that the rib process on the RP finishes the updates after ca. 1s. Then comes the bcdlv2 process that takes the updates and does the bulk download to the LC. Here the queue is served too slow. They're currently working to find out, what is slowing this down. But this is a moving target and the working hypothesis may change again ...

Regards,

Thomas
Thomas Schmid
2018-08-21 14:35:29 UTC
Permalink
Hi,

to give you an update: TAC finally could reproduce the issue in the lab. RIB/FIB sync is thwarted when there's a VSM module installed in the chassis (which we have in all 9k chassis).

Let's see if they can fix it with a SMU ...

Cheers,

Thomas
Post by Thomas Schmid
Hi all,
sort of a heads up ...
I'd be interested to hear if, and under which circumstances others are seeing this behavior,
since the root cause is still unknown.
In the beginning there were some anecdotical complaints
by customers that they experienced persistent reachability problems to some destinations
when we did a scheduled maintenance in our network somewhere else. Further
investigations pointed to routing inconsistencies during large RIB changes.
To give you some numbers: we found out that in our environment processing 70k BGP changes
takes 2-3 min to write the updates to FIB, 700k routes takes 20-30 min!!
blackholing, routing loops etc.
Convergence time seems to be somehow related to the number of eBGP sessions on the
box. On routers with less than 200 sessions, convergence time looks ok, from 300+
sessions on, things get bad.
This affects both XR 5.3.3, 6.2.3 and Typhoon, Tomahawk linecards.
TAC/BU are currently working on this, but they have a hard time to find out what's
going wrong here. Processing the updates on the RP takes less than 1s,
but writing the updates to the LC takes forever ...
Thanks,
Thomas
Bryan Holloway
2018-08-21 19:14:22 UTC
Permalink
Now that is a spicy meatball.
Post by Thomas Schmid
Hi,
to give you an update: TAC finally could reproduce the issue in the lab. RIB/FIB sync is thwarted when there's a VSM module installed in the chassis (which we have in all 9k chassis).
Let's see if they can fix it with a SMU ...
Cheers,
Thomas
Post by Thomas Schmid
Hi all,
sort of a heads up ...
I'd be interested to hear if, and under which circumstances others are seeing this behavior,
since the root cause is still unknown.
In the beginning there were some anecdotical complaints
by customers that they experienced persistent reachability problems to some destinations
when we did a scheduled maintenance in our network somewhere else. Further
investigations pointed to routing inconsistencies during large RIB changes.
To give you some numbers: we found out that in our environment processing 70k BGP changes
takes 2-3 min to write the updates to FIB, 700k routes takes 20-30 min!!
blackholing, routing loops etc.
Convergence time seems to be somehow related to the number of eBGP sessions on the
box. On routers with less than 200 sessions, convergence time looks ok, from 300+
sessions on, things get bad.
This affects both XR 5.3.3, 6.2.3 and Typhoon, Tomahawk linecards.
TAC/BU are currently working on this, but they have a hard time to find out what's
going wrong here. Processing the updates on the RP takes less than 1s,
but writing the updates to the LC takes forever ...
Thanks,
Thomas
_______________________________________________
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Thomas Schmid
2018-08-27 14:00:49 UTC
Permalink
Hi,
Post by Thomas Schmid
Hi,
to give you an update: TAC finally could reproduce the issue in the lab. RIB/FIB sync is thwarted when there's a VSM module installed in the chassis (which we have in all 9k chassis).
Let's see if they can fix it with a SMU ...
latest news: FIB updates are sent via the punt-switch (sort of regular ethernet switch on the RP) as multicast to all LCs. On A99-RP2-SE RPs this switch has flow control enabled. Now, the VSM LC answers to these updates with PAUSE frames and therefore the switch throttles the update rate to all LCs close to zero.

Good news: as a workaround you can disable flow-control to the punt-switch via CLI for specific LCs.

I'm reluctant to share the CLI command here because it's very hardware specific and I don't want anybody to run into problems just because their hardware is slightly different from ours.

Cisco will provide a SMU in the next weeks that either disables flow-control on these LC or changes the throttle thresholds (not decided yet).

Cheers,

Thomas
a***@netconsultings.com
2018-08-27 14:38:20 UTC
Permalink
Thomas Schmid
Sent: Monday, August 27, 2018 3:01 PM
Subject: Re: [c-nsp] ASR9k: RIB/FIB convergence
Hi,
Post by Thomas Schmid
Hi,
to give you an update: TAC finally could reproduce the issue in the lab.
RIB/FIB sync is thwarted when there's a VSM module installed in the chassis
(which we have in all 9k chassis).
Post by Thomas Schmid
Let's see if they can fix it with a SMU ...
latest news: FIB updates are sent via the punt-switch (sort of regular
ethernet switch on the RP) as multicast to all LCs. On A99-RP2-SE RPs this
switch has flow control enabled. Now, the VSM LC answers to these updates
with PAUSE frames and therefore the switch throttles the update rate to all
LCs close to zero.
Good news: as a workaround you can disable flow-control to the punt-switch
via CLI for specific LCs.
I'm reluctant to share the CLI command here because it's very hardware
specific and I don't want anybody to run into problems just because their
hardware is slightly different from ours.
Cisco will provide a SMU in the next weeks that either disables flow-control
on these LC or changes the throttle thresholds (not decided yet).
Thank you very much for the update,

This begs a question if the flow-control is enabled (i.e. the punt sw reacts to pause frames), under what circumstance can this happen with other line-cards.
In other words under which circumstances do various line-cards issue pause frames, or is just VSM capable of sending pause frames (an artefact of being a sort of "PC")


adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::


_______________________________________________
cisco-nsp mailing list cisco-***@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Loading...