performance

Azure Series: 05 – Database Performance

Database Performance

Gaining acceptable performance with client applications against a remote backend involves somewhat of a paradigm shift compared to how we’re used to developing desktop applications in local client/server environments. This latest post in the Azure Series covers the techniques and practices required to do so.

At 44 printed pages, this document’s a bit too large for a blog post, so you can find the PDF download below.

The Database Performance article is written for desktop developers who are new to working with remote backends and need some guidance on techniques for making the application perform well: a formidable task for the uninitiated.

The PDF can be downloaded here:

Below is a generalized table of contents that should help describe some of what you will find within the document.

  1. Foreword: why this was written, common concerns, and how we’ll attempt to address the problem.
  2. A Context and a Goal: This describes an example company that we’ll use throughout the article – a service company with some fairly demanding requirements.  We’ll also define the ultimate goal that we’re attempting to achieve in terms of performance.
  3. Getting Grounded: This discusses the basis of concerns when working with remote databases.  Namely the ODBC driver and how it effects performance.  This section also gives us some insight as to how we can recognize certain types of data and handle them accordingly, as there’s different methods that may work better than others depending on what “type” of data we’re working with.
  4. List/Static Data: how to handle data that’s more or less “just there” and isn’t expected to change often.
  5. Primary Data: core recordsets, form binding, basic CRUD operations on non-complex data.
  6. Handling Complex Queries: the real fun… how to efficiently work with highly complex datasets for more sophisticated processes and views.  This section starts by a detailed look at the example company’s schema, defines an example query, discusses passthrough queries in some depth, how to manage updates to complex data and finally some other design considerations for working with “heavy” datasets.
  7. Tips, Tricks and Other Considerations: Here, topics on threading, timers and events, async processing, service queues, remote polling, denormalization, XML/JSON and stored procedure/bind parameters should give us some food for thought.

Feedback is appreciated!

Cheers,
-jack

Async Processing in Access

I’ve been wondering for a while how to go about running async tasks within Access.  The lack of multi-threading support has always been something I wished was available, but not until relatively recently when dealing with Azure backend databases has this lack of feature been quite so painful.  While there’s typically plenty of ways around it, it would certainly be helpful to have in many cases.

In any case, I recently came up with a way to implement some rudimentary async processing by using a separate utility Access application that is fully managed by the main front end.  Commands can be sent to the async processor from the main application, and when completed (or at other desired intervals/events) the async processor sends a notification back to the main application.

Naturally there’s restrictions – the two aren’t “tightly tied” to each other and the only interface is through these command/notification methods (and any shared linked tables), but it’s enough to empower some pretty good functionality overall.  The restrictions aren’t much different from multi-thread programming in other platforms, so it should be no biggie.

Overall, it’s a considerably more friendly setup than tapping into the .NET framework with COM interop and wired events.

Attached is a zip file with a demo.  Be sure both Main and Util applications are in trusted folders!  Otherwise you’ll get an error when trying to instantiate the util app that does the async processing.

Open the main app and open Form1 to see it in action.  This uses a 5 second interval timer for the async process and simulates a long running task with a 4 second Sleep command.

Here’s some basic (pertinent) code, but the demo has everything wired up nicely.

Main Application

  '(in some startup or initialization rouinte...)
  'create a new instance of an Access application
  'and open the UtilApp
  Set mUtilApp = New Access.Application
  mUtilApp.OpenCurrentDatabase CurrentProject.Path & "\Util.accdb"
  
  'run the utilApp's Initialize procedure, passing this application
  'instance as a parameter via Run()
  mUtilApp.Run "Initialize", Application

(the callback for the notifications – make sure it’s in a standard module and public so the util/async app has visibility to it)

Public Function EventHubCallback(args As Variant)
  
  'this will be called by the utility application as certain processes
  'are completed.  It is essentially the entry point for "events"
  'Bear in mind that these can fire at any unknown time and will block
  'the main UI thread, so keep them short and sweet.
  
  On Error Resume Next
  Forms("Form1").UpdateNotification
  
End Function

Util/Async Processing Application

Option Compare Database
Option Explicit
'(in a standard module so the Initialize function
' can be called from the main application)
Private mHost As Access.Application
Public Property Get Host() As Access.Application
  Set Host = mHost
End Property

Public Function Initialize(obj As Variant)
On Error GoTo Err_Proc
'=========================
  'this procedure is called from the main application
  'and the obj is the instance of that application.
  'We'll hold it here so we can refer to it later to
  'send notifications when certain events happen.
  Set mHost = obj
  
  'this form has our timer code that will check
  'the queue for things to processes.  For the demo,
  'this just enters a Sleep cycle for 4 seconds to
  'simulate a long running process.
  DoCmd.OpenForm "Form1"
'=========================
Exit_Proc:
  Exit Function
Err_Proc:
  Err.Source = "App.Startup"
  Select Case Err.Number
    Case Else
      HandleError
  End Select
  Resume Exit_Proc
  Resume
End Function

(and the form with a timer to simplify the simulation of a long running task)

Private Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Sub Form_Timer()
On Error GoTo Err_Proc
'=========================
  'sleep for 4 out of the 5 seconds to simulate a blocking operation
  Sleep 4000
  Host.Run "EventHubCallback", "some random arg"
'=========================
Exit_Proc:
  Exit Sub
Err_Proc:
  Err.Source = "Form_Form1.Form_Timer"
  Select Case Err.Number
    Case Else
      HandleError
  End Select
  Resume Exit_Proc
  Resume
End Sub

Database Performance for Non-Databasey Devs

I had a junior dev (primarily front-end web) work on a site for me that included a couple calls to the underlying database.  Upon reviewing the work I decided to make a change to the way that some of the DB calls were handled.  Below is the explanation of why I made those changes…

So I’ve been bad at staying in touch (as usual) and was supposed to give you a rundown on why some changes to the DB reads for the comments makes such a big difference in performance.  Rather than waiting for us both to find time to do something not really required, I figured I’d at least give you a rundown via email…

(note: these apply to pretty much all relational database systems: MySQL, SQL Server, PostgreSQL, SQLite, etc)

A good relational database engine is one of the most impressive components of any computer system I’ve ever seen.  The amount of work that it goes through, potentially millions of rows of information that must be processed, and that it can still get your answer in a fraction of a second is phenomenal, in my book.

A few foundation points to remember when working with data:

  • Always think about where the data is coming from.  When you’re SELECTing from somewhere, think about what tables are involved and the scale of those tables.  Also consider what indexes you’re using, and whether any columns that you’re filtering and/or sorting on are against indexed columns.  Selecting columns to return is more or less irrelevant in terms of indexes, but whatever’s in the WHERE, ORDER BY or JOIN/ON are columns that the database engines use to search and define which actual records will be returned/joined, etc.  The rest of the cols in the SELECT list aren’t resolved until *after* the engine has identified the required rows via WHERE/JOIN/ORDER.  Thus, the contents of the SELECT list are more or less irrelevant for performance considerations.  Make it a point to have a list of indexes handy for a given database if you’re going to reading from it a lot and you don’t have someone managing the data access for you.
  • Disk reads are extremely slow.  Memory (cached) reads are extremely fast.  Most databases do not store data in memory (it’s an option, but quite advanced and not relevant for this discussion).  Some databases store all tables in a single file (SQL Server, for example, uses two files: one for the data, one as a transaction log, where MySQL stores a separate file for each table (I think…))  Plan on every new SELECT statement having to perform a read from the disk.  It won’t always be the case, but it will be enough to plan on it.  Disk reads are like 80% of the time it takes for the entire process: it’s significant.
  • Establishing connections is slow.  If the application/server setup doesn’t support connection pooling (eg, keeping a cache of open connections to the database that can be used at need), then each and every call to the database needs a connection to be created.  There’s various steps involved in this (and some are bypassed based on certain cached connection stuff depending on the platform/engine), but generally speaking the slow part is that the core DB engine needs to be found, then the authentication handshake needs to be performed (allowing the request into the database itself), then the permission credentials of the requester need to be checked against the objects that are being queried once the request is “inside” the database.  Only then can the database handle the request and return the records (this description is very distilled and the process changes much between engines: MySQL is very rudimentary compared to SQL Server in this respect (SQL Server has an excellent security model, but that also means complexity)).  In rough terms, let’s figure that if a disk read can consume 80% of the round-trip time, this connection establishment and authentication can consume 17% on top of that.
  • Multiple requests are slower than a single request.  If you can get the information in a single select query, you’re typically better off than if you were to use two separate queries (for each query, at runtime, the database engine will generate an execution plan based on the requirements, table sizes, tons of factors really, but the point here being that there’s some overhead for each new query.  Avoid it where reasonable (most times you’ll just do what makes sense from an application programming point of view in regards to this point, but it’s worth mention anyway).
  • The real power of a database engine shines when it can use Set based processing (as opposed to procedural processing).  This tends to be a bit of a crux to most, because almost all other programming (be it procedural or OOP, whatever), is *not* set-based.  Procedural processing in a db is referred to as RBAR: Row By Agonizing Row.  It’s insanely slow compared to set based processing.  I’ve used a crate of blueberries as an analogy before: if you have a crate of 1000 blueberries and you need to move it from one bench to another: set based processing is like picking up the crate by the handles and moving it to the other bench.  Non-set based is like picking up a single blueberry and putting it on the other bench, x1000.  I won’t go into too much detail here about set based vs. not set based, but generally speaking if you stick to basic commands/statements in SQL statements, you’re typically set based.  Procedural processing tends to come in when running stored procedures and calling custom functions.  A full dissertation of this topic is well beyond the scope of this email.
  • Think about the process required for each call to the database you make.  Will it have to read from a disk? Will it have to authenticate my request and resolve permissions? Will it have to regenerate multiple execution plans in order to get me all of the information I need to ask for?  Am I telling it to discern rows and join on key columns that have indexes (think library dewy decimal system), or will it have to look at Every. Single. Row.

Ok, so there’s the basics.  It’s a lot to keep in mind, but more or less just remember that there is a specific set of things that happen: the engine doesn’t just magically return values when you ask for them.  By keeping in mind the general idea of the process and some of those key points, you’ve increased your performance awareness factor times and times over again.  Let’s look at some examples.

The task was to come up with a count of comments for each post from within PHP.  A list of the posts was already generated and entered into a loop via PHP so the posts could be echo’d back to the browser, so at some point previously we already had to consult the database to get this information.

  •  Get list of posts from database
  • Enter loop to generate markup for each post

Now, to get the comments, you created a call to the database from within the loop to count the number of comments:

  •  Get list of posts from database
  • Enter loop to generate markup for each post
  • Make another call to the db to get the count of comments

Keeping in mind the list above, we can see that a number of key points have been violated: disk reads, connection establishment and query execution processing are running over and over again for each iteration of the original results.  That’s a TON of overhead, x50 (assuming the post count batch returns 50 records per call), and the page isn’t even starting to display until *after* all that is done (if it were ajax, it’d be a bit less of an impact on the UI end, though still a bit of unnecessary abuse against the database engine (imagine a couple thousand people visiting the site at once, pulling up the main feed, so take that lots of overhead x50 x1000…))

Essentially, by placing a separate database call into the loop, you’ve created a procedural style use of the database instead of a set-based… for each record returned, the DB has to go find other records based on it, but it can’t use its internal processing prowess to do so because these calls are separated at the application level.

What do we really need to do?  For each post, count the number of comments.

SQL can do that.  So if we add it as part of the original call (from which we’re returning specific posts anyway), we can get away from that x50 (potentially x50 x1000) extra information, merge it with an existing call, minimize disk reads, pull it all through one request, all by adding one join to the comments table and one aggregate (count).  MySQL handles this additional request in fractions of a second longer than it originally took, whereas before, for each individual call from within the loop might have taken near to 1 second alone, meaning that for 50 posts, you’d have a long time staring at the screen until the page showed up.  But now, we’re able to get *all* of that information (individual posts as well as the count of comments for each post) in around 1 second or so.

Here’s a generic SQL statement for getting a count of a related table:

SELECT parent.ID, COUNT(child.ID) FROM parent INNER JOIN child ON parent.ID = child.ParentID GROUP BY parent.ID;

Note that only two tables are present: parent and child.  Further note that the join columns (the key ones used to discern how records will be further processed and ultimately returned) are ID columns that are likely to be indexed.  There’s not fluff: no unnecessary tables, no unneeded joins, no filtering/sorting/joining on non-indexed columns.

Let’s take a look at the call that was set up to run from each iteration of the post generation loop:

function getPostComments($postID, $categoryID) {
    // not used anymore, included the comment count as an aggregate of the main feeditem rather
    // than calling a new database read for each item
     $html = "";
     $db = dymeng\DatabaseFactory::GetDatabase();
     $result = $db->Select("
           SELECT c.id,
                c.author_id,
                u.user_display_name,
                u.school_id,
                u.school_name,
                c.comment_content,
                c.comment_date
           FROM content_comments AS c
                INNER JOIN (
                     SELECT t.id,
                           t.user_display_name,
                           ts.id AS school_id,
                           ts.school_name
                     FROM users AS t
                           INNER JOIN schools AS ts ON t.user_school_id = ts.id
                ) AS u ON u.id = c.author_id
           WHERE c.content_id = ?
                AND c.content_category = ?
                AND c.comment_status = 1
           ORDER BY c.comment_date;"
                , $postID, $categoryID
     );
     if (!$result) {
           // no comments
           $html = "0";
     } else {
           $html.= count($result);
     }
     return $html;
}

Ok, it works, but aside from the “inside loop” problem, there’s a lot of fields and tables in there that simply aren’t required to get the information needed.  While JOINs can be pretty quick, they’re not free by any means (really, all that’s required is the comments table and the id/category, which we have as part of the original call anyway, so we didn’t need any joins, or even any other columns than our generated count).

Let’s take a look at what the original “get the posts” request was, with the new functionality added for the post count handling:

        SELECT
            fi.id,
            fi.feeditem_category,
            fi.feeditem_school_id,
            fi.feeditem_entry_date,
            fi.feeditem_username,
            fi.feeditem_user_display_name,
            fi.feeditem_title,
            fi.feeditem_content,
            COUNT(cc.id) AS comment_count
        FROM content_feed_items AS fi
        LEFT JOIN content_comments AS cc
            ON cc.content_id = fi.id AND cc.content_category = fi.feeditem_category
        WHERE feeditem_status = 1
            $categoryClause
           $schoolClause
        GROUP BY
            fi.id,
            fi.feeditem_category,
            fi.feeditem_school_id,
            fi.feeditem_entry_date,
            fi.feeditem_username,
            fi.feeditem_user_display_name,
            fi.feeditem_title,
            fi.feeditem_content
        ORDER BY fi.feeditem_entry_date DESC
         LIMIT " . ((int)$startPage * (int)$pageSize) . ", " . (int)$pageSize;

So, above I’ve added (per bold) the left join on the comments table and the join columns, both of which happen to be indexed.  Furthermore, in order to return non-duplicates of the posts and get the aggregate (count), I added the group by clause (per italic).

Now, one might look at this and think “omg that’s going to be so slow because now it needs to group all those records and fields together because we added an aggregate and it didn’t have to before and now the SQL statement is much more complex than it would have been just to count the comments separately from inside the loop!”   True, it looks more complex, and true, it does need to group everything now where it didn’t have to before, but for the most part that’s piddly stuff to a database engine: it’s still operating set based (so grouping is uber-fast), and it’s still just one call, one disk read, one connection and one execution plan.

As a perspective reminder: we were working with what’s essential test data and 15-20 posts.  The full page generation time after my modification is somewhere between 1-2 seconds.  Before my modification is was somewhere between 6-8 seconds, and would have gotten worse as more posts were returned (and would have had a significant load on the server which would have bottlenecked even more if there were actual users making requests en masse).

And that’s why I made some minor adjustments to how the comment counts were being generated 🙂

I think that’ll give you some new light to see while you’re making requests of databases.  I’m very much a db guy myself and don’t expect general application programmers (*especially* FE programmers) to have quite the depth of understanding as I do, but if you walk way from this with the general idea and a few key concepts, that can make a world of difference and then all that typing was worth something!

Cheers,

(P.S.: I’m going to turn this into a blog post…)