One of our clients utilises AWS Greengrass for local data collection and processing. Lately I was involved in the incident investigation for one our clients. The issue was that significant number of MQTT messages expected to reach local component were dropped or delayed, leading to major problems. I’d like to share a few words on the root cause as this is likely gonna help others facing similar issue. As of today, this part isn’t documented in AWS docs at all!
Identifying the problem
I’m not going to explain the whole investigation process as it was quite long and we were checking multiple angles (also outside of Greengrass). Let me get straight to the point.
The turning point was finding lots of following DEBUG logs in the greengrass.log
file:
{"thread":"nioEventLoopGroup-3-2","level":"DEBUG","eventType":null,"message":"Read suspend: 26:false:false","contexts":{},"loggerName":"io.netty.handler.traffic.AbstractTrafficShapingHandler","timestamp":1740294328434,"cause":null}
{"thread":"nioEventLoopGroup-3-1","level":"DEBUG","eventType":null,"message":"Write suspend: 16:true:true","contexts":{},"loggerName":"io.netty.handler.traffic.AbstractTrafficShapingHandler","timestamp":1740294329114,"cause":null}
First things first – loggerName is io.netty.handler.traffic.AbstractTrafficShapingHandler. What is this? Netty is a Java framework that simplifies building high-performance network applications. And the AbstractTrafficShapingHandler is a base class in Netty’s traffic shaping mechanism. In plain terms, it is used to control and throttle the flow of data over network channels. It monitors read and write operations and can temporarily suspend them (as you see in our logs) when the data rate exceeds set limits. This helps prevent network congestion and ensures stable throughput.
By now, you probably realise what has happened. Netty’s been suspending MQTT traffic exceeding the defined limits. Then we found following settings in the effectiveConfig.yaml under configuration for aws.greengrass.clientdevices.mqtt.Moquette component:
netty.channel.read.limit: 524288
netty.channel.write.limit: 524288
These appear to be default settings, which weren’t set by us. What it means is that Netty would allow for ~512 kB of data to be written/read per second, and it would suspend anything above. In our case, throughput was much higher and lots of messages were dropped or delayed as Netty was suspending them.
The fix
First thought is always the best, right? We decided to revise Greengrass deployment and updated the config for Moquette component:
{
"reset": [],
"merge": {
"netty.channel.read.limit": 2097152,
"netty.channel.write.limit": 2097152
}
}
This might of course lead to increased CPU usage, but our server was strong enough to handle increased load without issues.
After deploying above change, all problems went up in smoke.
Concern about Greengrass docs
One of the key takeaways is that such setting should most definitely be mentioned in AWS Greengrass documentation. This would save us, and possibly many other people struggling with the same issue, a lot of time. Of course we will take action on it – on of our team members already declared issuing a ticket for Greengrass team to update documentation.
Hopefully this piece of knowledge will save some time of yours.
Leave a Reply