We own a service supplying not-real-time data to the downstream clients. In order to reduce the average latency and the pressure on database, Memcached was imported as the cluster center cache, and our service consumed it. I encountered an issue when Memcached was firstly deployed on production environment:
[WARN] (Acceptor ServerSocket[addr=/0.0.0.0,port=0,localport=3961]) amazon.platform.servertoolkit.http.jetty.JettyServerManager: No thread for Socket[addr=/10.0.77.98,port=10545,localport=3961] : [Acceptor ServerSocket[addr=/0.0.0.0,port=0,localport=3961]]org.mortbay.util.ThreadPool.run(ThreadPool.java:350)
The server totally gave no response. So I checked the connection number:
netstat -an | grep ESTABLISHED | wc –l 54
And process number:
ps -ef|wc -l 564
The both seemed normal. One teammate remaindered me I should have a look at the threads of Jetty:
ps uH p 31636 | wc -l
I couldn’t find anything abnormal. I knew I might increase the thread up limit but that would not have any help to resolve the issue, at most it might only delay the issue’s happening.
Actually, this issue was a little more complex than we thought indeed.
The threads were running out from this morning, and client could not connect to our service. I took a snapshot (jstack -F -l 31636) for the stacks in JVM, finding that almost all the threads were hung reading from cache server:
Thread 19795: (state = IN_NATIVE) - sun.nio.ch.FileDispatcher.read0(java.io.FileDescriptor, long, int) @bci=0 (Compiled frame; information may be imprecise) - sun.nio.ch.SocketDispatcher.read(java.io.FileDescriptor, long, int) @bci=4, line=21 (Compiled frame) - sun.nio.ch.IOUtil.readIntoNativeBuffer(java.io.FileDescriptor, java.nio.ByteBuffer, long, sun.nio.ch.NativeDispatcher, java.lang.Object) @bci=116, line=198 (Compiled frame) - sun.nio.ch.IOUtil.read(java.io.FileDescriptor, java.nio.ByteBuffer, long, sun.nio.ch.NativeDispatcher, java.lang.Object) @bci=31, line=166 (Compiled frame) - sun.nio.ch.SocketChannelImpl.read(java.nio.ByteBuffer) @bci=101, line=245 (Compiled frame) - com.schooner.MemCached.SockInputStream.readFromChannel() @bci=25 (Interpreted frame) - com.schooner.MemCached.SockInputStream.read() @bci=31 (Compiled frame) - com.schooner.MemCached.SockInputStream.getLine() @bci=9 (Compiled frame) - com.schooner.MemCached.AscIIClient.get(java.lang.String, java.lang.String, java.lang.Integer, boolean) @bci=583 (Compiled frame) - com.schooner.MemCached.AscIIClient.get(java.lang.String, java.lang.Integer) @bci=6 (Compiled frame) - com.schooner.MemCached.AscIIClient.get(java.lang.String) @bci=3 (Compiled frame) - com.danga.MemCached.MemCachedClient.get(java.lang.String) @bci=5 (Compiled frame) - amazon.forecast.util.cache.memcached.BSFMemcachedClient.get(java.lang.String, java.lang.String) @bci=14, line=80 (Compiled frame) - amazon.forecast.distribution.Forecaster.getForecastDistributions(amazon.forecast.distribution.IForecastDistributionProvider, amazon.forecast.distribution.ForecastRequest) ...
On my understanding there should be some error on the connection from the client to cache server, then I checked the cache server but found it’s correctly and stably running, and of course, the network was good. We invoked Memcached service manually several times thru telnet, it did work normally. I took several snapshots in series again and got the same result — it seemed all the threads were being hung on the “read0” method.
However, after restarting the client the issue disappeared. By then nothing could explain why all the threads were being hung reading from cache server. One of my teammate told me if the tcp connection was dropped and the client was trying to read, it would not fail until the retransmission time out. The retransmission would happen in a back-off manner, and it would take long time for “read” to fail. Besides, It didn’t matter whether or not we were able to connect to the cache server. The exact tcp connection was broken on client, and it didn’t detect and was still using the broken socket to read.
Finally, we were still unclear about the root cause, but found a loop discussing about this issue:
A working solution is to set socket timeout to release the connection in time if unforeseen issue happens.