tracker issue : CF-4202859

select a category, or use search below
(searches all categories and all time range)
Title:

ColdFusion uses unsynchronized WeakHashMap in Remote Method Invocation during cache replication. This occasionally leads to infinite looping, hence 100% CPU usage.

| View in Tracker

Status/Resolution/Reason: To Fix//Investigate

Reporter/Name(from Bugbase): A. Bakia / ()

Created: 06/13/2018

Components: Caching, General, Performance

Versions: 11.0

Failure Type: Crash

Found In Build/Fixed In Build: 11,0,14,307976 /

Priority/Frequency: Normal / All users will encounter

Locale/System: English / Win 2012 Server x64

Vote Count: 0

Problem Description: 
Our system is made up of 3 Windows servers. Each server runs its own ColdFusion 11 application server, comprising 13 ColdFusion instances. Our sites get thousands of unique visitors daily. A load-balancer manages the load between the three servers. 

We also use ColdFusion's in-built Ehcache for cache management between the ColdFusion instances. ColdFusion implements Ehcache by internally using Java Remote Method Invocation (RMI) for the point-to-point messaging between the cache nodes. 
 
Occasionally (every other week), the RMI threads in some of the instances loop indefinitely. The result is invariably 100% CPU usage, leading to a server crash. The only solutuion is then to restart the affected instances. this can be 1 instance, two or more. 

When we do a thread-dump analysis during each high-CPU incident, we find that the cause is always RMI TCP Connection threads. 

Steps to Reproduce:
1) Setup: ColdFusion 11 Enterprise on 3 Windows servers, each running at least 3 ColdFusion instances. 
2) Use a load-balancer between the servers and configure the instances for Ehcache Replicated Caching using RMI. That is, the instances may share cached data. 
3) Test the system with high load (thoudsands of unique users per day), on an application that involves the reading, updating, storing and deleting of large amounts of cache data.  

Actual Result:
The server's CPU usage will occasionally rise to 100%. The cause is apparently the RMI TCP Connection threads involved in cache replication. See stack traces attached.

The RMI TCP Connection threads invoke the method java.util.WeakHashMap.put(WeakHashMap.java:453), which sooner or later results in an infinite loop. We noticed that the looping threads always originate from at least two different servers. See the thread-dump analysis attached (It was performed at http://fastthread.io/).

In fact, this same issue has been identified and confirmed elsewhere. Examples are:

http://adambien.blog/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap
https://access.redhat.com/solutions/55161
https://issues.jenkins-ci.org/browse/JENKINS-47725
https://issues.jenkins-ci.org/browse/JENKINS-41797
https://jira.pentaho.com/browse/PDI-14882

Expected Result:
Ehcache Replicated Caching using RMI does not result in infinite calls to java.util.WeakHashMap.put(WeakHashMap.java:453) and to 100% CPU usage.

Any Workarounds:
None known. For the time being, we just restart the affected instances.

Nevertheless, I would like to share with the ColdFusion Team 2 main ideas for a solution, which I came across in my research: 
(1) replace the WeakHashMap.put() call either with a synchronized Java Collections alternative;
or
(2)  wrap object values within WeakReferences before inserting, as in: myMap.put(key, new WeakReference(value)), and then unwrapping upon each get() call.

Further references:
https://github.com/jenkinsci/workflow-api-plugin/compare/1d1c6833fe1a...76f501eb6880
https://github.com/jenkinsci/workflow-api-plugin/commit/76f501eb688095d032bc5b9e4c9a55d0aa6b4bdd
https://github.com/jenkinsci/script-security-plugin/compare/7d56ac842117...598d7ef3040b
https://github.com/pentaho/pdi-osgi-bridge/pull/38

Attachments:

Comments:

Please read "solution" for "solutuion"
Comment by A. Bakia
29031 | June 13, 2018 03:30:19 PM GMT
Related bug reports: https://tracker.adobe.com/#/view/cf-4099820 https://tracker.adobe.com/#/view/CF-3295469
Comment by A. Bakia
29111 | June 20, 2018 02:57:03 PM GMT
A Bakia, don't see an entry point for CFML code in the stack trace you've shared. Is that something you can drill down to in the thread dump or CF logs or in you CFM application perhaps. That could help with simulating the CPU spike at my end. One of the bugs you've mentioned was fixed and released with update 11 of CF11.
29133 | June 25, 2018 10:54:01 AM GMT
Hi Piyush Kumar, Thanks for your reply. To start with, the top of the stacktrace is: java.lang.Thread.State: RUNNABLE at java.util.WeakHashMap.put(WeakHashMap.java:453) at coldfusion.runtime.StructWrapper.readResolve(StructWrapper.java:52) ... The call that is looping indefinitely is java.util.WeakHashMap.put().
Comment by A. Bakia
29136 | June 26, 2018 03:33:52 PM GMT
To respond to the NeedMoreInfo status of this report: it is apparent that the call java.util.WeakHashMap.put() originates from coldfusion.runtime.StructWrapper. Please let me know if you require more information.
Comment by A. Bakia
29367 | July 22, 2018 04:07:44 PM GMT