We recently encounter a severe (but rare) issue on one of our servers, that upon startup failed to handle any request. It took couple of month to nail it down. Inspecting the stacktrace suggested an issue with a lib called proj4j:
One of my colleagues found the following code that looks related:
But… it looks pretty clear that datum was entered to the list here…
Take a minute and look if you can spot an issue in the code above.
The root of the issue is the tree. Or more accurately TreeSet
which is not thread safe. You can see that supportedParams
is a member of the class, and in case it initialized by multiple threads it can cause a state corruption on the TreeSet — such as a missing “datum” String in our case.
There are various ways to solve the issue, but first let’s see if it’s already fixed?
Apparently yes, 8 years ago! in this commit. When we dug a little bit more it turned out we were using our own bogus clone of the lib and not the official distribution. There are two “official” forks, both of them with the fix:
- org.osgeo which doesn’t looks like being actively maintained but the bug was fixed.
- location tech (the maintainers of JTS) that looks more actively maintained.
We tried the second one, but failed to migrate to it due to this issue:
So eventually we decided to use the osgeo port of proj4j. Migration was not entirely smooth due to the above bug and api changes, but luckily for us — there was not a lot of code to migrate on our side.
Afterthoughts
- I am not sure if you notice, but the fix was adding a
synchronized
keyword on the method. I can think of better alternatives. I would probably make this set initialized on construction when the class loads. - Having an 8 years old, undocumented, lib is also something you should try to avoid.
- Concurrency bugs can be tricky!