Finding the cause of lag-spikes

 

   Fingering out massive lag spikes isn't always as easy as using /timings. Which is why I've created a lightweight utility to watch for spikes, and outpoint garbage collector information and the call stack of the thread.


    By default, your min-elapsed-time is most likely going to be around ~500 (I may change it in the future). That means, if a tick takes over 500 milliseconds to complete, clearlag will print various details about the thread that will help you figure out what exactly is causing the issue.

 

   I myself recently dealt with a major lag spike that took around 8 seconds. It ended up being caused by a MASSIVE garbage collection that was a result of too many objects being promoted to oldGen which quickly accumulated and caused a massive stop the world event at a desperate attempt to free the memory before the server ran out. The solution for this issue was to free temp memory at a much faster rate, and increase my survivor space so objects would, in theory, remain in survivor for longer instead of being prematurely promoted to oldGen.

 

Here is what a report of that looks like (follow-stack disabled):

 

[23:43:37] [Timer-1/WARN]: [ClearLag] Clearlag has detected a possible lag spike on tick #421828 (Tick is currently at 748 milliseconds)
[23:43:37] [Timer-1/WARN]: [ClearLag] Thread name: Server thread Id: 30
[23:43:37] [Timer-1/WARN]: [ClearLag] Thread state: WAITING
[23:43:37] [Timer-1/WARN]: [ClearLag] Thread stack-trace:

--Removed stack trace. Not relevant to the issue--

[23:43:37] [Timer-1/WARN]: [ClearLag] Thread 'Server thread' is no longer stuck on tick #421828
[23:43:37] [Timer-1/WARN]: [ClearLag] Estimated time spent on tick #421828: 6920
[23:43:37] [Timer-1/WARN]: [ClearLag] Garbage collection time during tick: 6829

 

   Here you can see that I had a 6920 millisecond lag spike, but also a 6829 millisecond garbage collection. That means that the spike was caused by a major garbage collection event - which it was. These major garbage collection events have simply too many causes to list, but normally they're caused by memory leaks. At least, that's what I've noticed from most people's servers who come to me. If you aren't able to track this issue back with ease, then I highly recommend using a profiler such as YourKit. With YourKit, you'll be able to dump the heap, and use it's MANY features to either see the retained size of objects/classes, or even use duplication analyzers to possibly find other types of memory hogs such as string duplication.

 

 

 

This output was caused by a world-save (Using 1.8.8 still..):

 

[23:38:39] [Timer-1/WARN]: [ClearLag] Clearlag has detected a possible lag spike on tick #416000 (Tick is currently at 126 milliseconds)
[23:38:39] [Timer-1/WARN]: [ClearLag] Thread name: Server thread Id: 30
[23:38:39] [Timer-1/WARN]: [ClearLag] Thread state: RUNNABLE
[23:38:39] [Timer-1/WARN]: [ClearLag] Thread stack-trace:
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.ChunkRegionLoader.a(ChunkRegionLoader.java:230)
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.ChunkRegionLoader.a(ChunkRegionLoader.java:128)
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.ChunkProviderServer.saveChunk(ChunkProviderServer.java:270)
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.ChunkProviderServer.saveChunks(ChunkProviderServer.java:343)
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.WorldServer.save(WorldServer.java:974)
[23:38:39] [Timer-1/INFO]: > org.bukkit.craftbukkit.v1_8_R3.CraftWorld.save(CraftWorld.java:776)
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.MinecraftServer.A(MinecraftServer.java:746)
[23:38:39] [Timer-1/INFO]: > net.minecraft.server.v1_8_R3.MinecraftServer.run(MinecraftServer.java:624)
[23:38:39] [Timer-1/INFO]: > java.lang.Thread.run(Thread.java:748)

[23:38:39] [Timer-1/WARN]: [ClearLag] Thread 'Server thread' is no longer stuck on tick #416000
[23:38:39] [Timer-1/WARN]: [ClearLag] Estimated time spent on tick #416000: 461
[23:38:39] [Timer-1/WARN]: [ClearLag] Garbage collection time during tick: 54

 

 

 

   Here, you can clearly see by reading the stack that we're stuck on ChunkProviderServer.saveChunk. It's not abnormal at all to get spikes from saving chunks due to the fact that they're saved on the main thread to prevent concurrency issues. There's no way to fix this issue other then creating your own chunk save system, or possibly editing some configuration options that are available to you on your Spigot/PaperSpigot/ect. I tend to keep my save interval to about 10 minutes.

 

   Due to this spike only taking ~461 milliseconds, the server should be able to catch up to it's original tick-rate rather quickly unless your full tick is incredibly high, like around 48ms. I generally recommend not allowing your average full-tick to exceed 35-40ms otherwise small lag spikes like world save's will have a noticeable impact. 

 

So if your garbage collection time is not the cause of the spike, and you're noticing a consistent stack-trace, you should be able to figure out what's causing your spikes. Most spikes are caused by Spigot plugins doing things such as saving YAML files on the main thread (Incredibly common...), massive loops, or just a struggling garbage collector.

 

If this - or timings doesn't help you resolve your lag spikes then  I highly recommend using a Profiler such as YourKit. There's never been a lag/memory issue I haven't been able to solve using a profiler. It's one of the major benefits of Java... Or just high level languages in general I suppose

 

FYI: ClearLag's spike detector isn't going to be deadly accurate with the timings. This is due to it being designed to have a very very minimal impact on the server rather them probing the main thread every millisecond. So I don't recommend you use it to detect very very minimal ticks, like a tick taking 60ms. Also keep in mind, the output stacktrace WONT always be related to the issue. The trick is to find consistency among your spike detection's.


Comments

Posts Quoted:
Reply
Clear All Quotes