vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Brauer and group, I may be way off, because I have not scanned this code fragment, the callers, etc.. However, my first assumption is that why would you have a p/peer struct hanging around in a STATE_NONE state? My assumption is if you see a hello packet from a nbr, you first verify that it is not from an existing nbr. Then, you alloc the peer struct, init it, and at the same time, you increment the count of the number of nbrs/peers within the AS. The incrementation should be done inside the alloc function, because the struct probably needs to be alloc'ed (you could have static structs), and mem allocs can fail. If diagnostics is set within the app (yes, you should be able to start it up with a diag arg), or... A global count like that is almost useless. It only use would be for a immediate informational message, that you have alloc'ed a new nbr/peer and possibly identify it via its IP address, and the interface that it is connected on.. Mitchell Erblich Sr Software Engineer -------------------- Henning Brauer wrote: > > I yesterday committed fixes for a memory corruption issue in both ntpd > and bgpd. This bug was very interesting for several reasons, and I > want to share how it came to that bug and how we found it, both for the > sake of the story and educational purposes. > > It all started mid-August with an eMail from Arvid Grotting to misc@, > which contained a very strange error from bgpd (it exits on those): > Aug 11 22:27:38 border bgpd[20595]: imsg_get: imsg hdr len out of bounds > we never saw that before, and I cooked him a diff with more disgnostics > in the error message. As this problem was only reproduceable by Arvid > on a production router, it always took a while until he could try diffs > on that one, of course. > After Arvid proved me wrong stating hardware issues by swapping > machines, and running diffs with much better diagnostic messages it was > clear something was going horribly wrong, the RDE got messages from the > SE with the length field set to invalid values - in all messages we saw > it was 1, while the header alone is at least 11 bytes. claudio and > myself read the imsg- and buffer-code again and again and again, trying > to find an off-by-one in memory address calculation and such, to no > avail. The better diagnostic messages then later showed a message type > arriving in the RDE, send by the SE, that is only ever send by the > parent process to the SE... I predicted memory corruption, and we were > searching for buffer management errors in the RDE for days, to no avail. > > Yesterday, otto ran into an ntpd issue. It would not query one of the > configured servers and block on recvfrom(). We tried using recvfrom() > with MSG_DONTWAIT with fixed the blocking issue. But otto fortunately > noticed that one server in the list was not queried at all, and found > out that it made a difference for the bug to show up wether and how > many listen-addresses were configured, and inserting one more server > statement also made a difference. Further debugging didn't yield to new > perceptions. > > On a mountain bike ride a few hours later I suddenly realized that > those bugs might be related. The common factor was a change in the > number of peers at runtime - arvid was using cloning, and otto the > "servers" keyword, which causes late expansion, long after forking and > such. > > With that idea in mind I came back asking more developers for their > eyes to help us finding the bug. It was millert@ who first had the idea > that the counters might be off. Otto, who had the ntpd issue easily > reproducable, verified that the variable i in ntp.c, ntp_main(), grew > bigger than pfd_elms! This meant we were writing to beyond the end of the > pfd array - and that was obviously our memory corruption. > > It didn't take me more than a few minutes to spot that peer_cnt was the > one getting off - when adding the code to expand the "servers" late, > after forking, I forgot to increase peer_cnt (and decrease it when we > delete the head struct for that servers pool, for that matter). > > I went checking bgpd. There, in the main loop in session.c, > session_main(), peer_cnt is properly increased when new peers show up. > New peers have their state set to STATE_NONE, so one of the first > things happening in that loop was > /* new peer that needs init? */ > if (p->state == STATE_NONE) { > init_peer(p); > peer_cnt++; > } > so all was fine in bgpd... apparently. There was one code path which > had to take a shortcut. That is when an OPEN message comes in from an > IP for which we don't have an exact match in our peer list, but that > matches a template (e. g. "neighbor 10.0.0.0/24"). We need to clone > that, and the newly cloned peer has to be used immediately, before the > main loop is traversed again to process the waiting OPEN message - so we > needed to initialize this peer immediately by calling init_peer(). so the > new peer was not in STATE_NONE any more, and the code in the mainloop > did not account for peer_cnt. > > same bug, but completely different ways to it... > > Just adding the missing peer_cnt++s at those two locations in ntpd and > bgpd was not satisfying of course, that only asked for the same failure > to be made again later. > > In ntpd, I added two new functions, peer_add() and peer_remove(), and > those are used everywhere where a peer is added or removed now instead > of doing it by hand. Those now care for keeping peer_cnt in sync with > reality. > > In bgpd, things are a bit more complicated. The idea was that both new > and to-be-deleted peers are just marked as such and the mainloop takes > care of everything else. The most logical way to resolve the issue for > now was letting init_peer() take care of increasing peer_cnt, > decrementing it is still handled in the mainloop. I am not 100% happy > with that slightly asymmetric handling, but more intrusive changes are > a no-no that close to release. > > So, what do we learn from this? > > First, when maintaining counters for list/queue/... entries, don't fuck > with either the counter or the list directly anywhere; use wrapper > functions that take care for both (not using a counter/list pair is > not an option in many, including these two, cases). Not that this is > really news, but very well worth recalling... > Well... bgpd mostly followed that model - and in one instance we were > mucking with the list directly nontheless. > > Second, this issue perfectly shows how important a team of good > developers is. Without otto spotting the ntpd issue, and _not_ settling > with the MSG_DONTWAIT "solution", we had not found the issue. > Without claudio and me searching for Arvid's bug (many thanks to Arvid > for the great debugging help, that was a very important factor as > well!) so intensively that we could be pretty sure that it is a memory > corruption bug (well, we thought it was in the RDE on the receiving > side, but in fact the messages got already corrupted in the SE on the > sending side), I could never had spotted the relation between the bugs. > Then, obviously, mountainbiking is good to find bugs. > Without Todd Miller raising the idea of the counters getting off and > Otto verifying that, we wouldn't have found it either. > And, last not least, I would never have committed the fixes in time > without Otto, Millert and Claudio, who spend some time in the code > while hunting (claudio obviously did before already), and canacar > checking my proposed fixes intensively. > > I can't say how happy and relieved I feel that we found this bug. > Claudio and myself were becoming pretty desperate of the bgpd issue. |