Uncategorized

Integrating 3rd Party Services: What to Know Before you Go

Often times when looking into software solutions, we may consider building a complete solution from the ground up versus making use of 3rd party services to cover the basic requirements in conjunction with some possible custom integration(s) between those services.  All of the pros and cons between completely custom vs. integrated 3rd party solutions is beyond the scope of this article, but we’ll try to cover some of the primary considerations in terms of integrating various 3rd party components together.

Every Service Does it Different

As we consider possible integrations, we should realize that every 3rd party service has the ability not only to grant whichever actions/integrations it may wish (or, none at all), but also that each 3rd party service could provide those integration points completely differently than any others.  In terms of developing and programming, this often means that in order to make use of these integrations, Service A requires development against set of practices X, while Service B requires a completely different set of practices Y.  In the end, we often end up with three distinct components for this: one to get X, one to get Y, and a middleman Z to do whatever it is we actually needed to do.

Each Service will Require “Setup”

For each service that might want to bring into the fold, there’s going to be a fairly significant “initial setup.”  While it’s not easy to put a number as to how many hours might be involved, it’s worth consideration when trying to put together a system of services to provide an overall solution.

This is mainly due to the first point that every service does things differently.  In this case, it’s usually a matter of how the service requires authentication and authorization (loosely speaking: security) as well as some other things such as how much data we’re allowed to work with at any given time and what types of workarounds we might have to put in place.

As an example, it’s very common for a service to state “we allow you to work with orders, but you can only read or edit X number per command.”  This is done to ensure that a customer with 100,000 orders doesn’t issue some command that clogs up the service’s resources and prevents anyone else from getting anything done.  The downside here is that there’s a number of different policies that could be in place for how to work with larger volumes of data, and as such, each new service that we want to incorporate into the larger solution is likely to require special attention to this area as well.

In any case, it shouldn’t be considered a trivial matter to integrate 3rd party services, so as we’re jotting down things we’d like to do while putting our system together, this is a good thing to keep in mind.

The Integration We Need Might Not be Available

Generally speaking, whether or not we can do what we’d like to do is very dependent upon what types of integration the services in question allow.  Some services are very much designed with tight integration in mind, while others add integration services as an afterthought, or possibly not at all.

It’s a fairly common task for example to pull a list of contacts and their basic contact info from a CRM system and update some other system with that information, and such an integration task can certainly be very useful.  More complex integration scenarios – especially tasks that would use realtime, event-driven behaviors across 3rd party providers – is often much more difficult if not practically impossible depending on the various services in question.

General Considerations

  • Try to keep the integrations minimal in terms of both the number of services to be integrated as well as the complexity of the integration.  Simple is more cost effective and also less prone to breaking changes as the 3rd party service inevitably makes their own changes.
  • For each service we’re considering integration for, inquire about their API (Application Programming Interface).  The API is what allows us to perform these integrations.  Some services may offer an expansive API while others offer none at all.  Many services will charge more for using an API, and often have pricing tiers based on the volume of API usage.
  • Ask their sales or technical department (preferably their tech department) if the general integration you’re looking for is supported.  For example, if you’d like to have a “goto customer” invoked in one service when a certain action happens in some other service, try to give a brief explanation of this desired behavior and see if it can be done (your development shop or IT department may have to work out the details here, but during the exploratory phase of planning your solution, it might not hurt to ask).

Sometimes things can pile up pretty quickly- be careful that the amount and complexity of the integrations required doesn’t overstep that of having a custom solution built.  In a culmination of the the various points noted above, we can often find ourselves with quite the puzzle to work out while trying to strike the right balance.  Remembering to keep things simple and restrict integrations to only those that are quite necessary, we’ll be better positioned for easier initial setup and long term stability of the overall solution.

The Smart Way to Handle your Data Generation Needs

I always hate having to generate test data… even with tools like Red Gate’s SQL Data Generator (which I use liberally when I need to do such tasks), there’s always a bunch of screwing around that needs to be done to get everything as tied up as neatly as possible, and in most cases you still can’t get actual, representative data.

Despite Troy Hunt’s less-than-satisfactory ode to Red Gate’s SQL Generator (sorry Troy, I really enjoy reading your stuff but I can only guess that you must have been paid well for this one), using a data generator – even a really good one like RG’s – still doesn’t get us all the way there.

In every database I’ve had to generate any sort of high volume test data for, I’ve never been able to get it quite right – sure, I can match it to schema requirements, that’s fine, but that doesn’t mean it resembles real world data or what the application expects to work with. I’d be hard pressed to come up with a definitive list of all the issues I’ve had with generated test data (every time, the list is as different as the projects themselves are different), but every time, there’s issues.

Recently we’ve been working on a sales/forecast analysis SaaS project for a client and have a whole new set of issues with generated data. To put the app requirements in scope, this will be a multi-tenant analysis application, so something as simple as customer sales and GP isn’t really all that easy to generate good data for (you’d think so, right? Pick a number range, pick a GPP range and let RedGate do it’s thing?)

The first issue is that we could have customers that work in thousands (let’s say a good set is between $1000 and $8000 per month), or other customers could be working in millions. RG (and most others) allows you to set a high/low limit for a numeric column, no problem, but we can’t very well have a customer with sales at $1234.00 for January and $12,345,678 for February (ok, maybe they’d like to, but let’s be real!). So what do we do, set up a RG project for each customer FK and choose a specific range for each FK? E.g., for CustomerID 456, generate between 4k and 8k ranged values, and for customer 789 generate between 1m and 2m values? Not really… RG is good at actually generating, but I don’t want to go through the project setup umpteen times to get a handful of customer sales dollars in there (besides which, I prefer my test data to not be tied so closely to redgate: I’d much prefer to dump a script out after and store that in source control for test data recreation rather than having everyone use RG and carry the RG Generator project around all the time).

The second issue – and this is one that pops up very often projects – is with data distribution. This tends to be a recurring issue with dates in general, but specifically on this project, the application needs to be able to analyze distribution and trends in sales and forecast data. Sure, having relatively random data in there is good for seeing how the application logic responds to such things, but in the real world we do tend to see trends in data, and we tend to see them often. How do we deal with that? Another good example is with birthdates and families or dependents. Generating a random set of dates in a range is simple, but how do you get that set up so that spouse birthdates are generally close to their counterpart while child birthdates are considerably newer?

To be fair, yes, I do realize that RedGate’s tool (among others, perhaps) allows for some custom expressions based on other columns, which is helpful, but still doesn’t quite get the job done (and again, this heavily ties the test datasets directly redgate’s project, which I much prefer not to do – and again, yes, I know the sqlgen file can be put into source control, but I don’t want to have to carry around RedGate licensing for every place this might need to be built).

Anyway, yes: data generation – even with best in class tools like RG’s data generator – still sucks. Generating accurate test data is still a major project and can be a significant cost due to the hours involved, unless your data test quality requirements happens to be very low.

So there’s two main issues I need to deal with; two main goals to achieve. First, generate medium to high quality test data, and second, do so in a way that can be easily recreated (preferably checked into VCS and ran as part of a build procedure… after all, sometimes we do need to develop against representative data, and adding rows in by hand is always a painful proposition).

The common approach, and certainly the one I started out with, is to more or less blindly open your generator tool and start setting up your configurations and then go ahead and dump boatloads of data directly into their target tables. This is generally adequate for low quality (e.g., not real world representative data) in mass amounts for stress testing, but even with careful planning you’re apt to run into some issues, and never mind the fact that it offers nothing in the way of dealing with any of the aforementioned problems.

The better approach, I’ve come to find, is to think not in terms of populating the final target tables with generated data directly, but instead to create an entirely separate set of tables for holding random values, which can be later applied to the final tables with much more integrity. By dumping the generated data into test tables, we can apply various metadata to those tables that further describes what the test table’s set represents.

Let’s take the case of birthdates and dependents. Rather than generating data directly to a Members or Persons or Patients table, let’s instead dump it into a birthdates table. As this will be a recurring theme, let’s go ahead and create a dedicated schema for these tables:

Now we can generate sets of birthdates that are categorized, say, by generation. As an example, you might use RedGate to generate four sets of test data that’ll resemble something like this:

Of course that’s not script you’d actually run, but it serves as an example well enough. Go ahead and generate a couple hundred random values into the table for each category, where the dates are ranged according to category. Then go to SSMS and generate scripts for the schema and data… save those to a file, then check that file into git, and there you have a repeatable, scriptable set of birtdates that actually represent what might be real birthdates roughly based on someone’s generation. Or, better yet, save that as part of your master toolset, as this export is absolutely a core artifact that can be reused on many projects.

Let’s do something similar with customer sales and gross profit percentages:

Again, use your generation tool to come up with separate sets for each volume range. Maybe Low will be $1000-$4000 where High will be measured in millions.

Ok, so that seems like a good idea, we can create categorized data sets from our data generator into test tables, but how do we correlate that to the actual target tables?

We’ll need to build the target tables’ test data in stages. For example, first we might use the “out of the box” RG generator to come up with a table of Member names, and maybe we’ll just pre-fill the DateOfBirth field with an arbitrary date. Let’s assume that we have set up some sort of correlation for ParentID to indicate whether the person’s parent is listed (which would allow us to differentiate between a child/adolescent and adult). Once you have a sufficient dataset generated there, we can use a CROSS APPLY query to select random values from the test data rows, like so:

Note the t.MemberID = t.MemberID in the cross-applied subquery there… without this, the update would select only one random value from the test set and apply that single value to all target rows. That ensures that each target row gets a new selection. The ORDER BY NEWID() is a simple means to sort randomly which works well in conjunction with SELECT TOP 1.

This is an effective means to populate specific data values with randomly generated data within a constrained range, and can be used in multiple places:

This helps a lot – certainly much better than out of the box generation from the various tools available and is relatively easy to implement.

The above cases and examples don’t touch base on more advanced scenarios such as sales distributions and/or data trends*, but by taking the same approach of generating categorized test datasets and then applying them to the target tables, we can come up with a much higher quality set of data with relatively little extra time involved. If we keep in mind that any test table can have any attributes defined as we like, and that we can play around with ordering and value selection in our application queries, our ability to repeatably generate medium to high quality test data remains feasibly within reach!

Cheers,
-jack

*for trends, you’ll find that some slightly more sophisticated test sets will need to be generated, but even so – often times there is requirement for such and generally speaking we don’t tend to walk away from such cases feeling like we’ve wasted time. For generating those more sophisticated test sets, consider also that you can use this cross-application technique to fill more advanced test sets from more basic test sets as well!

Web Work with VBA

It seems like a good portion of the posts zinging my radar over at UA these days are people asking how to deal with a) web service calls in general, and b) JSON.  As usual, I get tired of writing the same replies for generic answers, and wind up writing a blog post.

So… here’s a general rundown on how to deal with Web Services (aka Web APIs, aka Service, aka API) from within VBA (along with some background theory, as most questioners don’t tend to have the underlying knowledge of how the web really works).

The Theory!!!

Requests & Responses, Headers & Content

Web service work is typically done over the HTTP protocol.  This consists of two primary components: the Request and the Response. The Request is what we send to the remote api, and the response is what we get back.

A Request and Response have two primary components themselves: headers and the body (the body is often called the Content). The headers are not readily visible by non-developers, but the Body often is, in terms of regular webpages. Headers are key/value pairs. Each request or response usually has multiple headers, and they provide information such as the format of the data, authentication, originators and other meta-style information. Some header keys (and values) are standardized, though any custom key/values could be added as well. Often times, a service will require specific headers to be set.

The body can have content stuffed in it. This could be a simple data string, a bunch of XML, even encoded files (for uploading documents)… the body is basically just one big empty canvas where you can throw whatever.

When we make a Request, we specify and Endpoint (e.g., where it’s going: the URL of the web service), we supply the required headers, body content if applicable, and we send the request and receive a response.

Likewise, when we receive the response, it has however many headers that we may be interested in, and perhaps some body content, but most importantly it’s going to have an HTTP Status Code. The HTTP Status Codes are returned by the server indicating whether the request was handled properly. In general, a 200-based code means OK, a 300 code means the request was redirected, a 400 code means that the resource wasn’t found, and a 500 code means that there was some internal server error. In any case, google the list of HTTP Status Codes and familiarize yourself with a few of the common ones. (also note: just because the server returns a 200: Ok, doesn’t necessarily mean that your request was processed the way you intended… it might just mean that the server was able receive the request completely, but the actual results of your request might be stuffed into a header or the body… it all depends on the service setup).

RESTful vs SOAP

Ok… REST and SOAP. Don’t bother googling, it’s hard to actually find out what’s what if you’re not already in the know.

SOAP is older and is based on a (relatively complex) XML schema that forms the core of your request. This XML is typically submitted through the body of the request, and an XML-based SOAP response is returned, likewise a (relatively complex) XML schema stating the results of your request. SOAP is still used, but most people have gone the RESTful way in recent years.

REST, on the other hand, isn’t a specification (like XML’s SOAP wrappers) but more a method of doing things. As you’ll see in the next section, the HTTP protocol uses a small number of verbs as request methods (POST, GET, PUT, DELETE, etc). REST basically says “let’s use the HTTP verbs instead of trying to stuff everything into XML all the time for SOAP!” (ok, it comes with a number of other guidelines as well, more on the end of the web service developer, but this definition is fine for our purposes).

So, if you see a SOAP API, you’re likely to be working with some eyesore XML. If you see a REST API, you’re likely to be working with verbs, content/body and often JSON (which has also become very popular in recent years – more on that after the Verbs section).

Verbs

HTTP has a number of verbs that can stated as a method for a request. The most common (and default) of these is GET, but there’s also POST, PUT and DELETE (as well as a few others that aren’t very relevant to us right now).

A GET request is primarily URL based. This means that aside from the base endpoing (www.example.com/api/v1/someresource, for example), you’re also going to be putting request parameters as part of the querystring at the end of the URL, so you might have: http://example.com/api/v1/someresource?min_id=12&max-id=24&order=desc

(a querystring is the ? at the end of a base url, with further parameters delimited by ampersands – google for more info)

Which parameters are acceptable in a GET request is up to the developer of the API.

A POST request, on the other hand, tends to stuff its data into the body of the request. You have a base endpoint (url) and maybe a querystring as well (but not typically), then you fill up the content/body of the request with whatever data you’re posting to the server.

POST requests are most often realized by submitting HTML forms. In the vast majority of cases, when you fill out some online form and hit Submit, the page is wrapping all that form data up into a POST request to the server. This makes sense, considering that there could potentially be LOTS of data (max amounts restricted by web server configuration), whereas a URL (for a GET request querystring) is somewhat fixed in length (dependent upon browser).

When working with web APIs, you’ll run into both GET and POST fairly often. It’s hard to lean towards “usually” in this case, but I guess I will anyway: usually (ehem) you’ll be using a GET request to request some information, such as “get me a list of orders”, and POST requests to submit information.

DELETE is actually rarely used (for now, anyway). Usually a delete request would be done via GET or POST with authentication and a URL specifying, e.g.: http://www.example.com/api/v1/orders/delete/15 (to delete an order with ID 15… this is usually considered a “RESTful” URL, as it’s “Representative” of what we’re asking it to do…)

Web API servers can check the incoming request type and handle accordingly, so if you’re not using the required request method, your request may not be handled the way you need/expect. If it’s expecting/requiring a POST request to example.com/someendpoint and you submit a GET request to example.com/someendpoint, you might receive an error even though the URL is the same.

XML and JSON

Let’s talk data for a minute. There’s three types of standard data transfer formats: XML, JSON and url-formencoded (ok, and mime transfers which is for files). We really only need be concerned with the first two though. formencoded tends to deal with HTML forms, which we’re not likely to be using for web service work, and mime types are… if you need mime types you’re probably past this blog post anyway.

XML and JSON on the other hand are very relevant to our discussion. XML of course we’re all very familiar with. JSON is a very similar structured data, but a bit more compact, and also tends a bit harder to read. JSON popularity has grown enormously in the past five years or so, to the point where XML is often considered “old news” and irrelevant. This is actually not the case, yet our industry’s versions of hipsters and trendsters will be happy to tell you otherwise. Each have pros and cons, yet at the core remain very similar and both are very capable. However, we as application developers cannot usually make a choice of one or the other: many newer web APIs support only JSON, meaning that we have to adopt it whether we like it or not (and hence all these posts asking about how, and hence this blog post…)

XML you’ll recognize. JSON, which stands for JavaScript Object Notation you might, or might not. It’s a format of basically key/value pairs, curly braces for nested entities. Here’s an idea:

(now imagine that with all the whitespace and linebreaks removed… not very readable. Find yourself a good JSON prettifier)

In VBA, we have reasonable good support/tools for working with XML, but not so much for JSON. However, there’s a savior… we’ll touch base on this later in the “practical” portion.

Web Services are NOT Created Equal

Perhaps one of the most frustrating/irritating/annoying things about working with various web APIs is that everyone tends to do it differently. Different authentication models, different uses of headers and content, non-standard uses of any of the previously discussed components, piss-poor documentation (spare me!), half-assed implementations… when working with any web API, you’re really at the mercy of how well the api is designed and documented. USPS (United States Postal Service for my foreign friends) has a decent documentation and fairly straightforward API – FedEx? not so much (sometimes I seriously consider that documentation for some APIs are deliberately designed to fuck with your head just so you have to… call support I guess? I don’t know, but…)

A good, well designed, easy to use well documented web API is a blessing.

Luckily, the general acceptance of REST APIs seem to be somewhat standardizing things. If you try to google what REST is, your head will be spinning within minutes (few topics get so many objective opinionation on the web… design patterns for OOP are another, with so little of it being actually useful), but once you work with it a few times (often without even really knowing it) and have some experience with the other side, you’ll realize that it actually is pretty nice and relatively easy to follow (not unlike semantic naming conventions, which I believe I may have written about once before…)

Authentication

Authentication for Web APIs comes in a few flavors.  The most basic (and least used) is HTTP Authentication, which is the only one that’s actually part of the HTTP protocol and as such is implemented the same for any service that implements it.  This is done by setting a specific authorization header with a username/password string.  The only real trick to this is that the string must be encoded appropriately: a task we’ll touch base on in the next portion.  Simple as it is, HTTP authentication is not very robust and does not tend to be used often.

More commonly used is some sort of custom-implemented API key or sets of keys provided by a service owner to the developer or application.  The major shipping carrier UPS, for example, requires that the developer (or company creating the software) sign up for their developer platform, at which point the developer is given an API key.  The API key is then submitted with each request (usually as a request header or less often but sometimes through a querystring, or perhaps even in the XML or JSON data being submitted itself).

Management of such keys varies greatly… some service owners put time restrictions on a token which may last only for 30 minutes (or 30 days) before it expires.  Obtaining a new key or token requires an initial “pre-request” or sorts where you send a request to the server specifically asking for a new key or token so you can proceed with whatever you need to do.  These types of setups can make for some complex client logic (e.g., the VBA code) which must then assume that for every request made, we might get a bounce backing requiring an updated authentication token.

OAuth

Yet another increasingly common method of API authentication is OAuth.  OAuth is a complex protocol that allows credentials from one person’s account elsewhere to be used to access a separate resource.  If you’ve ever signed into some “lesser” website using your Facebook or Google account, this is an example of OAuth.  The OAuth provider (facebook or google or whatever), states yes, he’s who he says, here’s a key for him to carry for a bit as proof.

Facebook and Google aren’t often used for business-context APIs, but the same OAuth technique is often implemented with smaller providers.  Clio, for example (an industry-popular cloud based legal management software) requires OAuth authorization to use their API.  The setup assumes a standard user/UI account and a separate API portion.  To fire up the API, the developer registers their application with the UI account.  Then the person in charge of the UI account authorizes that application to use the API, in turn receiving what’s called a Bearer Token.  Subsequently, each API call to the Clio API must supply this Bearer token along with the request.  In this specific example of Clio, the bearer token expires only when explicitly revoked by the UI account manager (or in cases where a bill goes unpaid, presumably), but OAuth tokens in general can be set for any expiration time… it’s not uncommon for an OAuth token to be good only for 30 minutes or so before needing refreshment.

Bearer tokens with OAuth are only one means of many authentication models that OAuth supports, though they tend to be the most common, particularly in business-context APIs.

 

Rate Limiting & Throttling

The last particular point of note on consuming Web APIs is that there is next-door-to-always some sort of rate limiting and throttling imposed on the amount of requests that can be placed (or resources requested) within a given time period.  We won’t go into too much detail, but suffice to say that you’re bound to run into situations (more often than not) where your application will be allowed to only make X amount of requests per every 2 seconds, and only X amount of orders can be obtained within a single request.  This keeps the server from overloading if a ridiculous (or more often, malicious) request is made, and you’ll very often have to work something into your application than accounts for it.  Failure to adhere to the rate limits results in what’s termed as throttling, where the web service cuts back on the response times, refuses the requests, or other actions.

Some APIs implement flat rate limits such as 100 requests per second or 1000 per hour or whatever the rule is.  Others use algorithms such as the “Leaky Bucket” (https://en.wikipedia.org/wiki/Leaky_bucket) which is actually fairly nice for consumers: it’s still based on only X amount of requests being processed at once, yet as it allows for “bursts” of requests to be made, so long as the “burst” isn’t so heavy that it “fills the bucket.”

In any case, an application with any resemblance of volume data will have to deal with this in the code making those requests, and timeout/wait/retry periods will typically have to be defined (this is a particular downfall in VBA as it has no async support: consider delegating the request processing to a secondary helper application (e.g.: Async Processing in Access) or perhaps a .NET COM Interop library (e.g.: .NET COM Interop Starter’s Guide)

Pagination

A secondary consideration and often seen concept in terms of resource caps is pagination.  This is the technique of a web service indicating that they will only supply X amount of some resource for any given request.  For example, if you request all orders since 2003 from your eCommerce API, there’s a highly slim chance it’s going to send you back all of that information at once – it’d just be too much data to send in a single response, so instead it gets broken up into multiple request/response pairs (this also allows the server a chance to load balance the requests and look for rate limit/throttling).

In these cases, the response typically comes back with some sort of “next ID” or “offset” number, which you then include in your next request.  For example, a GET request to your eCommerce store might look like this: http://api.mystore.com/orders?startDate=20030101

Let’s say that’s going to yield 1,000 orders, but the resource limit for this particular service is 100 orders per request.  In this case, you send the request and receive a response with an Offset of 0.  This means that you’ve received order indexes 0-99.  Subsequently, you make the exact same request again, but this time with an Offset of 100 specified, so the response comes back with orders 100-199 and a return offset of 200.  Make another identical request except use a new offset of 200, so on and so forth until all of your orders are received (these scenarios are where the leaky bucket algorithm of rate limiting really shine).

… and the practice

(this is forthcoming, I have to pull out various examples of how to do all this stuff from various sources… check back in a week or so)

Keeping Secure Info from Source Control

(more self-reference stuff, notably in visual studio, but the process is similar regardless of platform – this is a various collection of resources not meant to be used all at the same time)

Basic steps (not considering more advanced setups such as Azure deployment, etc):

  1. Setup app.config/web.config with non-secure info that will be checked in to VCS
  2. Create separate config file for local developers, add secure data, add this to .gitignore
  3. Create separate test/prod config for build server
  4. In VS, mark this as “Content” file
  5. Add VCS repo readme notes for devs to create local.config file(s) and add to .gitignore
  6. Edit .csproj of composition root (i.e., the app) to copy the source files on AfterBuild target

Splitting app.config to separate files

(the standard app.config/web.config):

(the local.config):

Note that this splitting technique is also particularly helpful in enterprise environments with multiple composition roots so all the connections/settings can be sourced from a centralized file.

Using AfterBuild target for blank templates and auto-merges

This is useful for deployment to Azure where there’s no local settings for production builds. The idea here is to have an empty template file to copy, then in Azure portal under App Settings, add the settings and they’ll be filled accordingly:

References

http://www.codeproject.com/Articles/602146/Keeping-Sensitive-Config-Settings-Secret-with-Azur
http://www.codeproject.com/Articles/8818/Using-the-File-attribute-of-the-appSettings-elemen
https://msdn.microsoft.com/en-us/library/ms228154(v=vs.100).aspx