Why Is My App Slow? Part 1 - Quality Issues
Topics: Common Mistakes, Thoughts
If you have a web application or a mobile app with a backend that is slow, let me share 3 stories I happened to be connected with.
These are stories of actual projects, showing three typical scenarios where apps start to slow down. We'll see how and why these situations arise and think about the best ways to overcome them. I opted to create a separate article for every story, beginning with this one. It's about a returning customer who had a project developed by our company.
Same Project After Three Years
I was not involved in the first release development and launch of this application, but I heard about it. Our developers created the original project design and architecture and helped the customer to go live.
It all took place in 2017 and then we parted ways for three years. The customer chose other vendors that were cheaper, or closer, or more suitable in another way.
Then he returned in 2020 and asked for help because the web application was very slow. After certain user actions it would hang for 10 minutes - nobody could do anything in the browser. Then the application would go back to "normal" - I mean, it would work again till the next user action.
The project owner was putting a lot of effort in keeping his clients in the system, but the app performance did not give him a chance to get more business.
It was the point at which I stepped in. The project did not look like a long-term collaboration. I was going to apply a couple of patches, step over this work and switch to something else. At first, I updated the server with the latest software, expecting a boost in efficiency, and arranged for regular restarts to eliminate temporary glitches. But this didn't help. So, I unwillingly took on the labor-intensive task of reviewing the backend code.
I found a lot of surprises!
The main surprise I found was that the backbone of the software was still ours - all the architectural solutions were in place. But the way they were "enhanced" by our successors…
Imagine that there is a web page which makes a query to the database. For instance, it gets a list of companies and shows it in a table. Also imagine that there is a software developer whose assignment is to enhance the behavior by adding more information such as company managers, work days, locations, etc.
What would such a developer do? You may have guessed it - he would take the already written query and copy it multiple times. Sometimes he would even nest a query inside another query.
That would solve the problem. But at the same time it would result in a quadratic growth of the number of database queries.
That was what I actually saw. Time-consuming pieces of code were multiplied everywhere because of someone's copy-paste. That people were focused on their own assignments and did not feel the point where the architecture had to change.
Even if they felt it, they did not know how to leap off the copy-paste approach and write a faster code.
Consequences of Code Multiplying
The same style was applied to the background jobs. The original code had several jobs scheduled for background processing - after three years several turned into dozens of thousands.
I am not exaggerating. Someone’s assignment was to make the system send reminders. If there was an event, the system had to remind about it, say, three times. If there was a weekly event, the system had to remind about it three times each week.
The reminders were supposed to be marked as read after they were seen by the users. If the user did not read them, they had to expire automatically.
It was a good assignment worth creating a separate background job. This is what I would do - write a job that would run once per day and check which reminders should be sent and which should be marked as expired.
But this was not what was done.
The developer inherited an already-written background job sending one single reminder. What did he do with it? Of course, he reused it - multiple times. When an event was created, it would put 3 jobs in the queue (each for one reminder) plus one more job for marking the previous reminders as expired weeks later.
It made the system queueing 4 jobs each time a simple, not recurring event was created.
What would it do with a recurring event? For a recurring event that had a start time but no end time, they would queue - get a deep breath - 4 jobs per week for the next 10 years. That would make 2,240 jobs in the queue per recurring event.
What if you had 10 recurring events? You would have 22,400 events pushed to the background processor’s queue.
Pushing jobs to the background processor - an expensive operation - was what caused 10 minutes of the system downtime. When managers created a recurring event, they had to wait until the jobs were being scheduled. Actually, all users would have to wait until the jobs are all put in the scheduler - not just the managers.
Since then, the jobs would be sitting there (in the Redis database) waiting for their time to wake up - even if it was going to happen in 10 years.
That was insane. The customer thought that the project backend technology had hit its natural limitations.
It was an easy job for me. It is always a pleasure to see the “wow” effect after solving a relatively simple problem. You just need to delete the future jobs from the queue (who would need them in 10 years anyway?) and rewrite the logic to fit in a single job that runs once per day.
One might assume, these engineers were unqualified or under time pressure due to low rates, but no. The input in the problem was done by many developers from various places, different countries. They all took the same copy-paste approach without taking responsibility for making the project functional.
What they all had in common was they cared only about their own assignment. As soon as the assignment was done and the test case worked on a demo server on a single event, they would close the ticket. Who cared how it would perform in a live system with thousands of recurring events?
Be careful whom you work with!
What we can learn from this is that some problems can come from blind copying existing solutions instead of scaling them. Every time you add something to the code, it is worth weighing how much pressure it puts on the existing architecture. Quite often, some additional actions are required to keep the project performance high.
The story with thousands of queries and background jobs had a happy ending - the customer got more business as he wanted and was happy.
In the next article, I'll share the sequel to this story - how performance issues resurfaced in the same project. This time, it wasn't about low-quality code, but about the application struggling under the increased user load. This is the second common reason for an app's slowdown. What to do in such cases? Read the next article in this series to find out.