Application Design

Database Performance for Non-Databasey Devs

I had a junior dev (primarily front-end web) work on a site for me that included a couple calls to the underlying database.  Upon reviewing the work I decided to make a change to the way that some of the DB calls were handled.  Below is the explanation of why I made those changes…

So I’ve been bad at staying in touch (as usual) and was supposed to give you a rundown on why some changes to the DB reads for the comments makes such a big difference in performance.  Rather than waiting for us both to find time to do something not really required, I figured I’d at least give you a rundown via email…

(note: these apply to pretty much all relational database systems: MySQL, SQL Server, PostgreSQL, SQLite, etc)

A good relational database engine is one of the most impressive components of any computer system I’ve ever seen.  The amount of work that it goes through, potentially millions of rows of information that must be processed, and that it can still get your answer in a fraction of a second is phenomenal, in my book.

A few foundation points to remember when working with data:

  • Always think about where the data is coming from.  When you’re SELECTing from somewhere, think about what tables are involved and the scale of those tables.  Also consider what indexes you’re using, and whether any columns that you’re filtering and/or sorting on are against indexed columns.  Selecting columns to return is more or less irrelevant in terms of indexes, but whatever’s in the WHERE, ORDER BY or JOIN/ON are columns that the database engines use to search and define which actual records will be returned/joined, etc.  The rest of the cols in the SELECT list aren’t resolved until *after* the engine has identified the required rows via WHERE/JOIN/ORDER.  Thus, the contents of the SELECT list are more or less irrelevant for performance considerations.  Make it a point to have a list of indexes handy for a given database if you’re going to reading from it a lot and you don’t have someone managing the data access for you.
  • Disk reads are extremely slow.  Memory (cached) reads are extremely fast.  Most databases do not store data in memory (it’s an option, but quite advanced and not relevant for this discussion).  Some databases store all tables in a single file (SQL Server, for example, uses two files: one for the data, one as a transaction log, where MySQL stores a separate file for each table (I think…))  Plan on every new SELECT statement having to perform a read from the disk.  It won’t always be the case, but it will be enough to plan on it.  Disk reads are like 80% of the time it takes for the entire process: it’s significant.
  • Establishing connections is slow.  If the application/server setup doesn’t support connection pooling (eg, keeping a cache of open connections to the database that can be used at need), then each and every call to the database needs a connection to be created.  There’s various steps involved in this (and some are bypassed based on certain cached connection stuff depending on the platform/engine), but generally speaking the slow part is that the core DB engine needs to be found, then the authentication handshake needs to be performed (allowing the request into the database itself), then the permission credentials of the requester need to be checked against the objects that are being queried once the request is “inside” the database.  Only then can the database handle the request and return the records (this description is very distilled and the process changes much between engines: MySQL is very rudimentary compared to SQL Server in this respect (SQL Server has an excellent security model, but that also means complexity)).  In rough terms, let’s figure that if a disk read can consume 80% of the round-trip time, this connection establishment and authentication can consume 17% on top of that.
  • Multiple requests are slower than a single request.  If you can get the information in a single select query, you’re typically better off than if you were to use two separate queries (for each query, at runtime, the database engine will generate an execution plan based on the requirements, table sizes, tons of factors really, but the point here being that there’s some overhead for each new query.  Avoid it where reasonable (most times you’ll just do what makes sense from an application programming point of view in regards to this point, but it’s worth mention anyway).
  • The real power of a database engine shines when it can use Set based processing (as opposed to procedural processing).  This tends to be a bit of a crux to most, because almost all other programming (be it procedural or OOP, whatever), is *not* set-based.  Procedural processing in a db is referred to as RBAR: Row By Agonizing Row.  It’s insanely slow compared to set based processing.  I’ve used a crate of blueberries as an analogy before: if you have a crate of 1000 blueberries and you need to move it from one bench to another: set based processing is like picking up the crate by the handles and moving it to the other bench.  Non-set based is like picking up a single blueberry and putting it on the other bench, x1000.  I won’t go into too much detail here about set based vs. not set based, but generally speaking if you stick to basic commands/statements in SQL statements, you’re typically set based.  Procedural processing tends to come in when running stored procedures and calling custom functions.  A full dissertation of this topic is well beyond the scope of this email.
  • Think about the process required for each call to the database you make.  Will it have to read from a disk? Will it have to authenticate my request and resolve permissions? Will it have to regenerate multiple execution plans in order to get me all of the information I need to ask for?  Am I telling it to discern rows and join on key columns that have indexes (think library dewy decimal system), or will it have to look at Every. Single. Row.

Ok, so there’s the basics.  It’s a lot to keep in mind, but more or less just remember that there is a specific set of things that happen: the engine doesn’t just magically return values when you ask for them.  By keeping in mind the general idea of the process and some of those key points, you’ve increased your performance awareness factor times and times over again.  Let’s look at some examples.

The task was to come up with a count of comments for each post from within PHP.  A list of the posts was already generated and entered into a loop via PHP so the posts could be echo’d back to the browser, so at some point previously we already had to consult the database to get this information.

  •  Get list of posts from database
  • Enter loop to generate markup for each post

Now, to get the comments, you created a call to the database from within the loop to count the number of comments:

  •  Get list of posts from database
  • Enter loop to generate markup for each post
  • Make another call to the db to get the count of comments

Keeping in mind the list above, we can see that a number of key points have been violated: disk reads, connection establishment and query execution processing are running over and over again for each iteration of the original results.  That’s a TON of overhead, x50 (assuming the post count batch returns 50 records per call), and the page isn’t even starting to display until *after* all that is done (if it were ajax, it’d be a bit less of an impact on the UI end, though still a bit of unnecessary abuse against the database engine (imagine a couple thousand people visiting the site at once, pulling up the main feed, so take that lots of overhead x50 x1000…))

Essentially, by placing a separate database call into the loop, you’ve created a procedural style use of the database instead of a set-based… for each record returned, the DB has to go find other records based on it, but it can’t use its internal processing prowess to do so because these calls are separated at the application level.

What do we really need to do?  For each post, count the number of comments.

SQL can do that.  So if we add it as part of the original call (from which we’re returning specific posts anyway), we can get away from that x50 (potentially x50 x1000) extra information, merge it with an existing call, minimize disk reads, pull it all through one request, all by adding one join to the comments table and one aggregate (count).  MySQL handles this additional request in fractions of a second longer than it originally took, whereas before, for each individual call from within the loop might have taken near to 1 second alone, meaning that for 50 posts, you’d have a long time staring at the screen until the page showed up.  But now, we’re able to get *all* of that information (individual posts as well as the count of comments for each post) in around 1 second or so.

Here’s a generic SQL statement for getting a count of a related table:

SELECT parent.ID, COUNT(child.ID) FROM parent INNER JOIN child ON parent.ID = child.ParentID GROUP BY parent.ID;

Note that only two tables are present: parent and child.  Further note that the join columns (the key ones used to discern how records will be further processed and ultimately returned) are ID columns that are likely to be indexed.  There’s not fluff: no unnecessary tables, no unneeded joins, no filtering/sorting/joining on non-indexed columns.

Let’s take a look at the call that was set up to run from each iteration of the post generation loop:

function getPostComments($postID, $categoryID) {
    // not used anymore, included the comment count as an aggregate of the main feeditem rather
    // than calling a new database read for each item
     $html = "";
     $db = dymeng\DatabaseFactory::GetDatabase();
     $result = $db->Select("
           SELECT c.id,
                c.author_id,
                u.user_display_name,
                u.school_id,
                u.school_name,
                c.comment_content,
                c.comment_date
           FROM content_comments AS c
                INNER JOIN (
                     SELECT t.id,
                           t.user_display_name,
                           ts.id AS school_id,
                           ts.school_name
                     FROM users AS t
                           INNER JOIN schools AS ts ON t.user_school_id = ts.id
                ) AS u ON u.id = c.author_id
           WHERE c.content_id = ?
                AND c.content_category = ?
                AND c.comment_status = 1
           ORDER BY c.comment_date;"
                , $postID, $categoryID
     );
     if (!$result) {
           // no comments
           $html = "0";
     } else {
           $html.= count($result);
     }
     return $html;
}

Ok, it works, but aside from the “inside loop” problem, there’s a lot of fields and tables in there that simply aren’t required to get the information needed.  While JOINs can be pretty quick, they’re not free by any means (really, all that’s required is the comments table and the id/category, which we have as part of the original call anyway, so we didn’t need any joins, or even any other columns than our generated count).

Let’s take a look at what the original “get the posts” request was, with the new functionality added for the post count handling:

        SELECT
            fi.id,
            fi.feeditem_category,
            fi.feeditem_school_id,
            fi.feeditem_entry_date,
            fi.feeditem_username,
            fi.feeditem_user_display_name,
            fi.feeditem_title,
            fi.feeditem_content,
            COUNT(cc.id) AS comment_count
        FROM content_feed_items AS fi
        LEFT JOIN content_comments AS cc
            ON cc.content_id = fi.id AND cc.content_category = fi.feeditem_category
        WHERE feeditem_status = 1
            $categoryClause
           $schoolClause
        GROUP BY
            fi.id,
            fi.feeditem_category,
            fi.feeditem_school_id,
            fi.feeditem_entry_date,
            fi.feeditem_username,
            fi.feeditem_user_display_name,
            fi.feeditem_title,
            fi.feeditem_content
        ORDER BY fi.feeditem_entry_date DESC
         LIMIT " . ((int)$startPage * (int)$pageSize) . ", " . (int)$pageSize;

So, above I’ve added (per bold) the left join on the comments table and the join columns, both of which happen to be indexed.  Furthermore, in order to return non-duplicates of the posts and get the aggregate (count), I added the group by clause (per italic).

Now, one might look at this and think “omg that’s going to be so slow because now it needs to group all those records and fields together because we added an aggregate and it didn’t have to before and now the SQL statement is much more complex than it would have been just to count the comments separately from inside the loop!”   True, it looks more complex, and true, it does need to group everything now where it didn’t have to before, but for the most part that’s piddly stuff to a database engine: it’s still operating set based (so grouping is uber-fast), and it’s still just one call, one disk read, one connection and one execution plan.

As a perspective reminder: we were working with what’s essential test data and 15-20 posts.  The full page generation time after my modification is somewhere between 1-2 seconds.  Before my modification is was somewhere between 6-8 seconds, and would have gotten worse as more posts were returned (and would have had a significant load on the server which would have bottlenecked even more if there were actual users making requests en masse).

And that’s why I made some minor adjustments to how the comment counts were being generated 🙂

I think that’ll give you some new light to see while you’re making requests of databases.  I’m very much a db guy myself and don’t expect general application programmers (*especially* FE programmers) to have quite the depth of understanding as I do, but if you walk way from this with the general idea and a few key concepts, that can make a world of difference and then all that typing was worth something!

Cheers,

(P.S.: I’m going to turn this into a blog post…)