If your lovingly crafted application is running slowly after deployment, there are five common reasons it runs well on your development machine but has totally borked in production.
There are of course other reasons why your software doesn’t work well in production, but these are the top reasons I’ve seen when developers say “it runs fine on my machine” and then discover that with great volume comes great responsibility for the concerns of scale.
Cause No. 1: You have one big thread
In some modern frameworks (like Node.js), threading is handled for you. You’re supposed to make the right nonblocking I/O calls at the right time to go do things and have your main line of code clean of a lot of heavy lifting. Failing that, you can start to starve the system of actual threads.
If you have this problem, it begs some questions. The most basic is: If you’re doing a major algorithm inside of some JavaScript running in Node.js, whether Node.js and (gasp) JavaScript is the right technology to use here? If you must, you need to learn about how Node.js handles concurrency and how to avoid blocking the event loop. You need to learn how to submit work to a workerpool instead. You may even have to learn (gasp!) about threads.
Cause No. 2: Your database is !@#$
There are a lot of reasons your database can be blown. The first and most obvious is a lack of an index. If you’re on an SQL database, you should learn how indexes work. If you have a where
clause with three key/value pairs in it and you run it over and over with different values, that should probably be an index. For example:
select … from customer c where c.firstname = :fname and c.lastname = :lname and c.state = :state
You could have three indexes (and there might be reasons to do that). However, that still results in index merges—meaning the results of three searches have to be merged. Instead, consider an index that includes all three fields in one index. There is a caveat to all of this: Indexes do tend to slow inserts and updates.
Another common reason is a total misdesign of your schema. I was once on a largeMongoDB project where the customer designed the schema like it was a RDBMS.
Nothing executed reasonably because MongoDB wasn’t designed to join tables to do simple things like look up a phone number; it was designed to have that right in the customer document as an attribute. If you have a bad schema design, your application can’t run well on it.
In modern applications, developer preference has had a lot to do with database selection, but not every application performs great on every type of database:
- If you are doing hierarchical queries or finding the relationship between two rows, you shouldn’t be on an RDBMS.
- If you’re basically reimplementing tables on top of a key-value store, just stop.
- If you have mostly friend-of-a-friend (FoaF) queries, maybe you need a graph
- If you’re doing a lot of queries with conditional field names like
Foo%
searches, use an index like Apache Solr (yes, that’s my company’s product) instead (at least for that part).
Another less obvious reason your database is borked is that you’ve tried to open too
many connections at once. For example, if you have one database connection pool locally that opens 100 connections but you’ve got 15 application servers all opening 100 connections, that’s 1500 connections that have to be opened at once. That may not work too well. You may need to do this a little at a time. Your ops people should know about this and how to constrain how much traffic makes it to the application servers at a given time (how to start up “warm” rather than “hot”).
If the performance issue is the database, you should be able to find them with database monitoring tools, by logging query return times, or by listening to the DBA who told you the database couldn’t handle all of those connections.
Know the database you’re writing to and what it likes in terms of schema and practices. Pick the appropriate database or databases for the job.
Cause No. 3: You didn’t size memory correctly
Most modern business software run on some sort of stack-based virtual machine. I’m not talking about VMware or Docker, but something more like the Java Virtual Machine (JVM). Without getting into much detail on the inner workings of VMs, nearly all of them require that you to dedicate a certain amount of memory called a heap. They also use other types of memory every time they launch a thread. If they run low on heap memory, they’ll spend a lot more time on memory management, which will look (until they crash and burn) like the application just got slow.
On the JVM, you can turn on garbage-collection logging, which will show you how many collections are being run. You can also just up the heap size, but do that judiciously.
Many people think the heap is the only kind of memory but there is also the JVM’s -xss
stack size option. Each thread gets a certain amount of stack memory. If
System Memory – (heap + otherstuff + (numthreads * numstack)) i<= 0
then when you grab another thread, you’ll throw a special kind of out of memory exception. Depending on the kinds of libraries you’re running, that might be a thread pool that doesn’t expand or a database connection pool that doesn’t expand—both will look like a slowdown. The good news is this is captured in any log.
Cause No. 4: You sized your thread or connection pools incorrectly
If you’ve got 1,000 concurrent users and five database connections in the pool, you’ve probably got a wait condition waiting on that pool. If you’ve got 100 HTTP threads on top of that and a TCP backlog setting of 5, after 105 people try to connect, you’re going to see a “connection refused” message—but things will get really slow before then. In addition, some software has a number of “accept” threads, which basically “answer the phone and hand it off to one of those other threads.” Usually there is one, maybe two, of those.
There is no hard-and-fast rule for what those numbers should all be, but be reasonable on the proportions. Also remember cause No. 3 while doing this because you can run into other constraints.
Cause No. 5: You haven’t set your limits and file handles correctly
Most operating systems have limits on the number of threads and files an operating system’s user is allowed to open. If you run at this limit, things get slow before they fail. You should see this in the log, and there are tools to show what file locks and handles are in use.