This is a discussion on Unable to make outbound connections all of a sudden within the Ingres forums, part of the Database Server Software category; --> Hi folks We has a live production installation with 250+ clients connected via an openroad 4.1 application, the b/e ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi folks We has a live production installation with 250+ clients connected via an openroad 4.1 application, the b/e is AIX 5.2 . This has had no h/w changes for 2 years, no upgrade to Ingres (II 2.6/0305 (rs4.us5/00) patch 10384), no CBF changes or PC changes (NODES , that we've been told about) . The o/r application has not changed for some months. We have load balancing with 5 iigcc's starting (one interesting point is the fact we now only appear to have one of them clocking up CPU time when looking at a ps -ef output). Two days ago the following error started appearing in the Ingres errlog :- Y8707E ::[37528 IIGCC, 0000000d]: Wed Feb 7 16:03:18 2007 E_CLFE05_BS_CONNECT_ERR Unable to make outgoing connection. Y8707E ::[37528 IIGCC, 0000000d]: System communication error: Socket is not connected. A restart to Ingres and all will work ok for an hour or so. Another Company supports the AIX server, although we do have Ingres access (but not root), they are saying this is an Ingres problem..... We do also have access to errpt and there is an error showing :- LABEL: GOENT_LINK_DOWN IDENTIFIER: DED8E752 Date/Time: Thu 8 Feb 08:03:23 2007 Sequence Number: 515851 Machine Id: 005C90DF4C00 Node Id: Y8707E Class: H Type: TEMP Resource Name: ent1 Resource Class: adapter Resource Type: 14106902 Location: U0.1-P1-I1/E1 VPD: Product Specific.( ).......10/100/1000 Base-TX PCI-X Adapter Part Number.................00P4501 FRU Number..................00P4501 EC Level....................H12511 Manufacture ID..............YL1021 Network Address.............000255535899 ROM Level (alterable).......GOL002 Description ETHERNET DOWN Probable Causes CABLE CSMA/CD ADAPTER Failure Causes LINK TIMEOUT Recommended Actions CHECK CABLE AND ITS CONNECTIONS Detail Data FILE NAME line: 191 file: goent_limbo.c PCI ETHERNET STATISTICS 0000 0005 0063 0853 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0002 0000 0001 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 BB80 00F0 0249 0C00 0000 0000 01A0 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 DEVICE DRIVER INTERNAL STATE 3333 3333 0000 0000 0000 0000 SOURCE ADDRESS 0002 5553 5899 We are being told this is a physical NIC which is not been used and nothing to do with the problem. .... Whilst the fault is present you cannot access IPM (except with -s) or iinamu. Even when you first re-start Ingres, although you can start IPM you cannot access the "Server" option? I've seen a number of old posts on here on the subject, but little in the way of responses, so assume this is either a silly question, vague or not a well known subject. Funny enough those looking after the server have told us they've requested an engineer to look at the server (probably 24-48 hours away...)? My knowledge of TCP/IP is nil, but from what I've read this looks like a server problem related to TCP sockets i.e. a server problem. But any suggestions would be most welcome as there are 250 user that can't work. Steve |
| |||
| On Feb 8, 12:23 pm, "steveg15" <steve....@atosorigin.com> wrote: > Hi folks > > We has a live production installation with 250+ clients connected via > an openroad 4.1 application, the b/e is AIX 5.2 . This has had no h/w > changes for 2 years, no upgrade to Ingres (II 2.6/0305 (rs4.us5/00) > patch 10384), no CBF changes or PC changes (NODES , that we've been > told about) . The o/r application has not changed for some months. We > have load balancing with 5 iigcc's starting (one interesting point is > the fact we now only appear to have one of them clocking up CPU time > when looking at a ps -ef output). > > Two days ago the following error started appearing in the Ingres > errlog :- > > Y8707E ::[37528 IIGCC, 0000000d]: Wed Feb 7 16:03:18 2007 > E_CLFE05_BS_CONNECT_ERR Unable to make outgoing connection. > Y8707E ::[37528 IIGCC, 0000000d]: System communication error: > Socket is not connected. > > A restart to Ingres and all will work ok for an hour or so. > > Another Company supports the AIX server, although we do have Ingres > access (but not root), they are saying this is an Ingres problem..... > We do also have access to errpt and there is an error showing :- > > LABEL: GOENT_LINK_DOWN > IDENTIFIER: DED8E752 > > Date/Time: Thu 8 Feb 08:03:23 2007 > Sequence Number: 515851 > Machine Id: 005C90DF4C00 > Node Id: Y8707E > Class: H > Type: TEMP > Resource Name: ent1 > Resource Class: adapter > Resource Type: 14106902 > Location: U0.1-P1-I1/E1 > VPD: > Product Specific.( ).......10/100/1000 Base-TX PCI-X Adapter > Part Number.................00P4501 > FRU Number..................00P4501 > EC Level....................H12511 > Manufacture ID..............YL1021 > Network Address.............000255535899 > ROM Level (alterable).......GOL002 > Description > ETHERNET DOWN > > Probable Causes > CABLE > CSMA/CD ADAPTER > > Failure Causes > LINK TIMEOUT > > Recommended Actions > CHECK CABLE AND ITS CONNECTIONS > > Detail Data > FILE NAME > line: 191 file: goent_limbo.c > PCI ETHERNET STATISTICS > 0000 0005 0063 0853 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0001 > 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0002 0000 0001 0000 0001 0000 0000 > 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 BB80 00F0 0249 0C00 0000 0000 01A0 > 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 > DEVICE DRIVER INTERNAL STATE > 3333 3333 0000 0000 0000 0000 > SOURCE ADDRESS > 0002 5553 5899 > > We are being told this is a physical NIC which is not been used and > nothing to do with the problem. .... > > Whilst the fault is present you cannot access IPM (except with -s) or > iinamu. Even when you first re-start Ingres, although you can start > IPM you cannot access the "Server" option? > > I've seen a number of old posts on here on the subject, but little in > the way of responses, so assume this is either a silly question, > vague or not a well known subject. Funny enough those looking after > the server have told us they've requested an engineer to look at the > server (probably 24-48 hours away...)? My knowledge of TCP/IP is nil, > but from what I've read this looks like a server problem related to > TCP sockets i.e. a server problem. But any suggestions would be most > welcome as there are 250 user that can't work. > > Steve Paul The config.dat parameters have not changed. Ingres is started in batch (by us) as Ingres Netstat shows a number of users connected (when any of them can connect), but the bulk of them are using a single port. Out of 194 users 180 will be against the second port (18065 in our case) and the rest split over the other 4. One significant factor that's since come to light is the fact the licence has been out of date since 01/01/07 (this is not supported by us). The CA site indicates there is some problem, but what this results in is somewhat vague... http://supportconnectw.ca.com/public/ ca_common_docs/lic_expire.asp. What's been said in this looks to me like CA do not know the implications, but I think the needs to be fixed a.s.a.p. There is and engineer looking at the AIX server as I write.... Steve |
| |||
| At 4:23 AM -0800 2/8/07, steveg15 wrote: >[snip] >Whilst the fault is present you cannot access IPM (except with -s) or >iinamu. Even when you first re-start Ingres, although you can start >IPM you cannot access the "Server" option? Sounds like the Name Server is getting hosed somehow. Is it running? Can you connect to the DBMS server directly? (find out what the port number for the dbms server is, by looking in errlog.log if necessary, and export II_DBMS_SERVER=portnumber sql some-database locally on the aix box.) If you can connect to the DBMS server directly, it sounds like a stuck name server. It's not immediately clear to me why the name server would be hanging, but it is odd that this started all of a sudden. Am I right in thinking that existing connections are OK, it's just that no new ones can be made when this happens? It does sound as if something changed on at least some of the clients, such that you're burying one of the comm servers instead of using them all. I don't see how that would be related to the hang. (GCC/GCN is not my area of expertise though.) Karl |
| |||
| Steve said: > We has a live production installation with 250+ clients connected via > an openroad 4.1 application, the b/e is AIX 5.2 . <snip> >We have load balancing with 5 iigcc's starting (one interesting point is > the fact we now only appear to have one of them clocking up CPU time > when looking at a ps -ef output). ' Steve, what does the client netutil look like? do you have merged vnode entries on each client (II0, II1, II2, etc.)? Is each client netutil configured seperately or do you use some kind of funky, shared network client installation? I know you say that nothing has changed, but if that is the case, if one user changed the netutil installation, it might have affected all clients. If you do have merged netutil entries, can you do a test connection to each gcc within netutil from a remote client? Another thing ... this is a longshot, but, are you by any chance using installation passwords? If so, check the following file: $II_SYSTEM/ ingres/files/name/IILTICKET_*. Does it seem really large and growing? Is the iigcn process consuming an inordinate amount of cpu? If so, do this : ingstop -iigcn rm $II_SYSTEM/ingres/files/name/IILTICKET_* ingstart -iigcn Then get back in touch with me: There is a known Ingres bug that can bring the name server to a screeching halt when you have frequent connections and use installation passwords; used installation password "tickets" aren't correctly "cleaned up", causing the IILTICKET file to grow forever. When it reaches "critical mass" (in my case, that was a couple-hundred Mb), iigcn cpu use soars and the gcn no longer can respond adequately to remote connection requests. I wouldn't think that 250 users would be enough to cause you problems (actually, I wouldn't even think that you would need five iigcc's), but anything is possible. These are the only things that come to mind. If you are desperate and really adventurous, you can activate GCN tracing, which might shed some additional light on what's going on. export II_GCA_LOG="mytracefile.out" export II_GCN_TRACE=5 ingstop -iigcn ingstart -iigcn Good luck! Jim Gramling Rio de Janeiro |
| ||||
| Sorry ... I just noticed that, in your post, you said "unable to make outbound connections" ... so I misunderstood your post: your problem is on the client machine? What kind of machine is it? running AIX also? Have you tried rebooting your client machine? Still could be a ticket issue on the server side if you are using installation passwords. Try bypassing the name server as Karl suggested; also, take a look at "show comsvr *" in iinamu to see if all five gcc's are actually registered. Regards, Jim Gramling Rio de Janeiro On Feb 8, 8:13 pm, "Jim Gramling" <jimwgraml...@gmail.com> wrote: > Steve said: > > > We has a live production installation with 250+ clients connected via > > an openroad 4.1 application, the b/e is AIX 5.2 . > <snip> > >We have load balancing with 5 iigcc's starting (one interesting point is > > the fact we now only appear to have one of them clocking up CPU time > > when looking at a ps -ef output). > > ' > Steve, what does the client netutil look like? do you have merged > vnode entries on each client (II0, II1, II2, etc.)? Is each client > netutil configured seperately or do you use some kind of funky, shared > network client installation? I know you say that nothing has changed, > but if that is the case, if one user changed the netutil installation, > it might have affected all clients. > > If you do have merged netutil entries, can you do a test connection to > each gcc within netutil from a remote client? > > Another thing ... this is a longshot, but, are you by any chance using > installation passwords? If so, check the following file: $II_SYSTEM/ > ingres/files/name/IILTICKET_*. Does it seem really large and > growing? Is the iigcn process consuming an inordinate amount of cpu? > If so, do this : > > ingstop -iigcn > rm $II_SYSTEM/ingres/files/name/IILTICKET_* > ingstart -iigcn > > Then get back in touch with me: There is a known Ingres bug that can > bring the name server to a screeching halt when you have frequent > connections and use installation passwords; used installation password > "tickets" aren't correctly "cleaned up", causing the IILTICKET file to > grow forever. When it reaches "critical mass" (in my case, that was a > couple-hundred Mb), iigcn cpu use soars and the gcn no longer can > respond adequately to remote connection requests. I wouldn't think > that 250 users would be enough to cause you problems (actually, I > wouldn't even think that you would need five iigcc's), but anything > is possible. > > These are the only things that come to mind. If you are desperate and > really adventurous, you can activate GCN tracing, which might shed > some additional light on what's going on. > > export II_GCA_LOG="mytracefile.out" > export II_GCN_TRACE=5 > > ingstop -iigcn > ingstart -iigcn > > Good luck! > > Jim Gramling > Rio de Janeiro |
| Thread Tools | |
| Display Modes | |
|
|