Diagnosing “random” connection resets in 11g

This was a pretty weird problem I have dealt with in the past few days. We migrated a database system from 10g to 11g a while back and almost everything worked just fine. Of course, we also rolled out new clients to the app servers and things pretty much worked. But occasionally, servers would get a “Connection reset” error/exception when trying to connect. All information we had on this issue was the stacktrace from the driver which really does not tell you a whole lot.

Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at oracle.net.ns.DataPacket.send(DataPacket.java:219)
at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:208)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:224)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:172)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:97)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:82)

So eventually, this landed on my desk and we checked the sqlnet logfiles and settings but things looked good there. We also checked network settings and statistics in the OS but things looked good there aswell. I would have liked to blame this on the networking guys but all systems are on the same subnet so this could not be an issue with a firewall or router.

So in all my desperation I asked the mighty google machine which came up with this OTN forum thread which suggested that this could be related to the implementation of random number generation on linux systems.

At some point during the connection establishment, the driver requires some random numbers which by default are generated by /dev/random on linux systems. But this pseudo-device blocks when there is not enough entropy in the system to ensure “real randomness”. Entropy is generated by mouse and keyboard input aswell as some network drivers. It looks like our machines in trouble were not generating entropy fast enough as they needed to at some points and this caused the jdbc driver to fail connecting. Anyway, the suggested workaround of setting the default source of randomness to the non-blocking /dev/urandom helped.

There is also a quite well hidden metalink note on this which also links to this really good blog article.

I remember that this has bitten me in the ass before about 10 years ago where an SSL-enabled apache webserver refused to start under certain conditions. Our first workaround was to send someone to go to the box and move the mouse or type stuff on the keyboard…

Speaking at UKOUG in Birmingham

This year’s event calender is quickly filling up with my presentation “Setting up RAC for planned downtime” being accepted at the UK’s user group conference in December. I have not been to this conference before and am thrilled to finally check out Europe’s largest english speaking Oracle event. I have heard only good things about this event and I am sure that there will be lots of smart, nice and interesting people to meet and exchange ideas with.

This might also be a good chance to get together with other RAC SIG members from the UK and Europe. Let me know if you are interested in setting something up.