Why Is My App Slow? Part 2 - Increased Load
Topics: Common Mistakes, Thoughts
I go on to describe the typical issues that cause an app to function at an unforgivably slow pace, based on projects I've been directly involved in.
In my previous article, I described the first common scenario that developers encounter when they are asked to speed up an app. It was the story of a website created by engineers scattered around the globe. They didn't coordinate their actions and preferred the easiest path, mindlessly copying code without caring for the system's performance.
Here comes the second story about speed issues in the same project. This time it was not so simple. The application was taken care of properly and we had lots of new users. Three years later, when the number of users grew twice, we started experiencing speed issues during peak times.
Search to the Bottom
So, we had to slow down our pace in feature development and dedicate some time to performance optimization. This time it required much more skills and techniques. It was not just about restructuring the code. Rather, it was about looking for bottlenecks that would show up at our nighttime, when the development team was sleeping.
We took several steps to get to know the truth.
First, we set up various metrics in AWS to see what was causing the overload during peak hours. After collecting the metrics we checked specific hours. Very quickly we found our first suspect - the database.
Just like in the first story, upgrading the server instance did not help much. The system started working more smoothly, but the problems did not go away.
The next step was looking for queries that were causing a high load. AWS provided a feature called ‘Performance Insights’ that was very helpful. It showed so-called ‘Top Waits’ that included top SQL queries causing the system to wait until the data was fetched.
These queries were what we had to improve.
Queries could get to that list because of several reasons. One of the reasons was that the query itself could be slow. Another reason was, the query was fast enough but it was called many times within a small amount of time.
The list contained a lot of ‘noise’ - Metabase requests, API sessions requests from the mobile apps, etc. - queries that were not supposed to be ‘waits’, so we closed a few gaps just by getting the noise out of the way: Metabase was disabled, some other queries were improved to the extent when they did not come into our view anymore. However, these queries were not the actual problem.
The problem was in high peaks caused by frequent queries that were not supposed to be slow - but they were. You would not notice one of such requests, but if you had five or six at about the same second - they would create a queue and wait for their turn to be processed.
We caught three or four problematic queries. Once we had them, the next step would be to investigate their plans though ‘EXPLAIN ANALYZE’. I learned a lot about Postgres while doing this.
Some problems were related to missing indexes, some - to a poor database design that resulted in oversaturated queries. Some could be solved through better SQL conditions, some required the query to be decomposed into several ones. The whole refactoring took some time.
What I have learned from it, there is a point at which a superficial glance is not sufficient. One needs to look closer at implementation details, learn more about ways to get a better performance from the tools used, and search to the bottom.
The interesting thing was that we managed to resolve all issues without adding any complexity to the infrastructure - no new tools like Elasticsearch or whatever. All problems could be solved just by making the code better. There was a lot of room for improvement even though the project lasted for years and was written well.
Thus, the second usual cause for apps slowing down as time goes by is the escalating load due to an expanding customer base. And the solution, like in the first scenario, isn't about new hardware or powerful technologies, but again about the developers' qualifications and experience.
In my next article, I'll tell yet another story about yet another common cause of application sluggishness - its inherent complexity and massive data processing. What tactic to choose in this case? Stay tuned to this series, I'll reveal it all.