XML, API, CSV, SOAP! Understanding the Alphabet Soup of Data Exchange
Let’s say you have two different software packages, and you’d like them to be able to share data. What would be involved? Can you link them so they exchange data automatically? And what do all those acronyms mean? Peter Campbell explains.
There has been a lot of talk lately about data integration, Application Programming Interfaces (APIs), and how important these are to non-profits. Much of this talk has focused on the major non-profit software packages from companies like Blackbaud, Salesforce.com, Convio, and Kintera. But what is it really about, and what does it mean to the typical org that has a donor database, a web site, and standard business applications for Finance, Human Resources and payroll? In this article, we’ll bypass all of the acronyms for a while and then put the most important ones into perspective.
Nonprofits have technology systems, and they live and die by their ability to manage the data in those systems to effectively serve their missions. Unfortunately, however, nonprofits have a history of adopting technology without a plan for how different applications will share data. This isn’t unique to the nonprofit sector: throughout the business world, data integration is often underappreciated.
Here’s simple example: Your mid-sized NPO has five fundraising staff people that together bring in $3,000,000 in donations every year. How much more would you bring in with six fundraising people? How much less with four? If you could tie your staffing cost data to hours worked and donations received, you would have a payroll-to-revenue metric that could inform critical staffing decisions. But if the payroll data is in an entirely different database from the revenue data, they can’t be easily compared.
Similarly, donations are often tracked in both a donor database and a financial system. If you’ve ever had to explain to the board why the two systems show different dollar amounts (perhaps because finance operates on a cash basis while fund development works on accrual), you can see the value in having systems that can reconcile these differences.
How can you solve these data integration challenges? Short of buying a system that tracks every piece of data you may ever need, data exchange is the only option. This process of communicating data from one system to another could be done by a straightforward manual method, like asking a staff member to export data from one system and import it into another. Alternatively, automatic data transfers can save on staff time and prevent trouble down the road - and they don’t have to be as complex as you might think.
What does it take to make a data exchange work? What is possible with your software applications? This article explains what you’ll need to consider.
Components of Data Exchange
Let’s get down to the nitty-gritty. You have two applications, and you’d like to integrate them to share data in some way: to pull data from one into another, or to exchange data in both directions. What has to happen? You’ll need four key components:
- An Initiating Action. Things don’t happen without a reason, particularly in the world of programming. Some kind of triggering action is needed to start the data interchange process. For an automatic data exchange, this is likely to be either a timed process such as a scheduler kicking off a program at 2AM every night, or a user action – for instance, a visitor clicking the Submit button on your website form.
- A Data Format. The data to be transferred needs to be stored and transferred in some kind of logical data format - for instance, a comma delineated text file – that both systems can understand.
- A Data Transfer Mechanism. If both applications reside on your own network, then a transfer is likely to be straightforward – perhaps you can just write a file to a location where another application can read it. But if one or both applications live offsite, you might need to develop a process that transfers the data over the internet.
Let’s look at each of these components in more detail.
An initiating action is what starts things rolling in the data exchange process. In most cases, it would take one of three forms:
- Human Kickoff. If you’re manually exporting and importing files, or need to run a process on a schedule that’s hard to determine in advance, regular old human intervention can start the process. An administrator might download a file, run a command line program, or click a button in an admin interface.
- Scheduler. Many data exchanges rely on a schedule - checking for new information every day, every hour, every five minutes, or some other period. These kinds of exchanges are initiated by a scheduler application. More complex applications might have a scheduling application built-in, or might integrate with Windows Scheduler or Unix/Linux Chron commands.
- End User Action. If you want two applications to be constantly in synch, you’ll need to try to catch updates as they happen. Typically, this is done by initiating a data exchange based on some end user action, such as a visitor clicking the Submit button on an online donation form.
In order to transfer data from one system to another, the systems need to have a common understanding of how the data will be formatted. In the old days, things were pretty simple: you could store data in fixed format text files, or as bits of information with standard delimiting characters, commonly called CSV for “Comma Separated Values”. Today, we have a more dynamic format called XML (eXtensible Markup Language).
An example fixed format file could be made up of three lines, each 24 characters long:
Name (20) Gender(1) Age(3)
Susan f 25
Mark m 37
A program receiving this data would have to be told the lengths and data types of each field, and programmed to receive data in that exact format.
CSV is easier to work with than fixed formats, because the receiving system doesn’t have to be as explicitly informed about the incoming data. CSV is almost universally supported by applications, but it poses challenges as well. What if your data has quotes and commas in it already? And as with fixed formats, the receiving system will still need to be programmed (or “mapped”) to know what type of data it’s receiving.
CSV is the de facto data format standard for one-time exports and data migration projects. However, automating CSV transfers requires additional programming – batch files or scripts that will work with a scheduling function. Newer standards, like XML, are web-based and work in browsers, allowing for a more dynamic relationship with the data sets and less external programming.
The XML format is known as a “self-describing” format, which makes it a bit harder to look at but far easier to work with. The information about the data, such as field names and types, is encoded with the data, so a receiving system that ‘speaks” XML can dynamically receive it. A simple XML file looks like this:
An XML friendly system can use the information file itself to dynamically map the data to its own database, making the process of getting a data set from one application to another far less laborious than with a CSV or fixed width file. XML is the de facto standard for transferring data over the internet.
Data Transfer Mechanisms
As we’ve talked about, an initiating action can spur an application to create a formatted. data set. However, getting that data set from one application to another requires some additional work.
If both of your applications are sitting on the same network, then this work is likely pretty minimal. One application’s export file can easily be seen and uploaded by another, or you might even be able to establish a database connection directly from one application to another. However, what if the applications are in different locations? Or if one or both are hosted by outside vendors? This is where things get interesting.
There are multiple ways to exchange data over the web. Many of them are specific to the type of web server (Apache vs. Microsoft’s IIS) or operating system (Unix vs Linux vs Microsoft) you’re using. However, two standards – called “web services” - have emerged as by far the most common methods for simple transfers: SOAP (Standard Object Access Protocol) and REST (Representational State Transfer).
Both SOAP and REST transfer data via the standard transfer protocol mechanism of the web: HTTP. To explain the difference between REST and SOAP, we’ll take a brief detour and look at HTTP itself.
HTTP is a very simple minded thing. It allows you to send data from one place to another and, optionally, receive data back. Most of it is done via the familiar Uniform Resource Identifier (URI) that is typed into the address bar of a web browser, or encoded in a link on a web page, with a format similar to:
There are two methods built into HTTP for exchanging data: GET and POST.
- GET exchanges data strictly through the parameters to the URL, which are always in “this equals that” pairs. It is a one-way communication method - once the information is sent to the receiving page, the web server doesn’t retain the parameter data or do anything else with it.
- POST stores the transferred information in a packet that is sent along with the URI – you don’t see the information attached to the URI in the address bar. Post values can be altered by the receiving page and returned. In almost any situation where you’re creating an account on a web page or doing a shopping transaction, POST is used.
The advantage to GET is that it’s very simple and easy to share. The advantages to POST are that it is more flexible and more secure. You can put a GET URI in any link, on or offline, while a POST transfer has to be initiated via an HTML Form.
However, add to the mix that Microsoft was one of the principal developers of the SOAP specification, and most Microsoft applications require that you use SOAP to transfer data. REST might be more appealing if you only need to do a simple data exchange, but if you’re working with Microsoft servers or applications, it is likely not an option.
Transformation and Validation Processes
While this article is focused on the mechanics of extracting and moving data, it’s important not to lose sight of the fact that data often needs a lot of work before it should be loaded into another system. Automated data exchange processes need to be designed with extreme care, as it’s quite possible to trash an entire application by corrupting data, introducing errors, or flooding the system with duplicates.
In order to get the data ready for upload, use transformation and validation processes. These processes could be kicked off either before or after the data transfer, or multiple processes could even take place at different points in time. An automated process could be written in almost any programming language, depending on the requirements of your target applications and your technical environment.
- Converting file formats. Often, one application will export a data file with a particular layout of columns and field names, while the destination application will demand another.
- Preventing duplicates. Before loading in a new record, it’s important to ensure that it doesn’t already exist in the destination application.
- Backup and logging. It’s likely a good idea to kickoff a backup of your destination database before importing the data, or at least to log what you’ve changed.
- User interface. For complex processes, it can be very useful to provide an administrative interface that allows someone to review what data will change and resolve errors prior to the import
- Additional Data Mining. If you’re writing a process that analyzes data, adding routines that flag unusual occurrences for review can be very useful. Or if you’re uploading donation data that also has to go to Finance, why not concurrently save that into a CSV file that Finance can import into their system? There are plenty of organizational efficiencies that can be folded into this process.
As described in the API section below, a sophisticated application may provide considerable functionality that will help in these processes.
Application Programming Interfaces (APIs)
What about APIs? How do they fit in? We’re hundreds of words into this article without even a mention of them – how can that be? Well, APIs are a fuzzy concept that might encompass all the aspects of data exchange we just discussed, or some of them, or none of them at all. Clear as mud, right?
An API is exactly what it says – an interface, or set of instructions, for interacting with an application via a programming language.
Originally, APIs were built so that third party developers could create integrating functions more easily. For instance, a phone system vendor might write specific functions into their operating system so that a programmer for a voice mail company could easily import, extract, and otherwise work with the phone system data. This would usually be written in the same programming logic as the operating system, and the assumption was that the third party programmer knew that language. Operating systems like Unix and Windows have long had APIs, allowing third parties to develop hardware drivers and business applications that use OS functions, such as Windows’ file/open dialog boxes.
APIs are written to support one or more programming languages – such as PHP or Java – and require a programmer skilled in one of these languages. An API is also likely to be geared around specific data format and transfer standards – for instance, it may only accept data in a particularly XML format, and only via a SOAP interface. In most cases, you’ll be limited to working with the supported standards for that API.
Choose Your Own Data Exchange Adventure
The type of data exchange that makes sense and how complex it will be varies widely. A number of factors come into play: the applications you would like to integrate, the available tools, the location of the data, and the platform (i.e Windows, Linux, web) you’re using. Integration methods vary widely. For instance:
- Striped all the way down to the basics, manual data exchange is always an option. In this case, an administrator (a Human Kickoff initiating action) might download a file into CSV, save it to the network, perform some manual transformations to put it into the appropriate file format, and upload it into a different system.
- For two applications on the same network, the process might not be too much more complex. In this case, a Scheduler initiating action might prompt one application to export a set of data as a CSV file and save it to a network drive. A transformation program might then manipulate the file and tell the destination application to upload the new data.
- Many web-based tools offer simple ways to extract data. For instance, to get your blog’s statistics from the popular tracking service FeedBurner, you could use a scheduled initiating action to simply request a FeedBurner page via HTTP, which would then provide you the statistics on a XML page. Your program could then parse and transform the data in order to load into your own reporting application or show it on your own website. Many public applications, such as GoogleMaps, offer similarly easy functionality to allow you to interact with them, leading to the popularity of Mashups- applications that pull data (generally via APIs) from two or more website.
- If you are using a website Content Management System which is separate from your main constituent management system, you may find yourself with two silos containing constituent data – members who enrolled on your web site and donors tracked in a donor database. In this circumstance, you might setup a process that kicks off whenever someone submits the Become a Member form. This process could write the data for the new member into an XML file, transfer that file to your server, and there kickoff a new process that import the new members while checking for duplicates.
Finding Data-Exchange-Friendly Software
As is likely clear by now, the methods you can use to exchange data depend enormously on the software packages that you chose. The average inclination when evaluating software is to look for the features that you require. That’s an important step in the process, but it’s only half of the evaluation. It’s also critical to determine how you can – or if you can – access the data. Buying into systems that overcomplicate or restrict this access will limit your ability to manage your business.
Repeat this mantra: I will not pay a vendor to lock me out of my own data. Sadly, this is what a lot of data management systems do, either by maintaining poor reporting and exporting interfaces; or by including license clauses that void the contract if you try to interact with your data in unapproved ways (including leaving the vendor).
To avoid lock-in and ensure the greatest amount of flexibility when looking to buy any new application – particularly the ones that store your data off-site and give you web-based access to it – ask the following questions:
- Can I do mass imports and updates on my data? If the vendor doesn’t allow you to add or update the system in bulk with data from other systems, or their warrantee prohibits mass updates, then you will have difficulty smoothly integrating data into this system.
- Can I take a report or export file; make a simple change to it, and save my changes? The majority of customized formats are small variations on the standard formats that come with a system. But it’s shocking how many web-based platforms don’t allow you to save your modifications.
- Can I create the complex data views that are useful to me? Most modern donor, client/case management and other databases are relational. They store data in separate tables. That’s good – it allows these systems to be powerful and dynamic. But it complicates the process of extracting data and creating customized reports. A donor’s name, address, and amount that they have donated might be stored in three different, but related tables. If that’s the case, and your reporting or export interface doesn’t allow you to report on multiple tables in one report, then you won’t be able to do a report that extracts names and addresses of all donors who contributed a certain amount or more. You don’t want to come up with a need for information and find that, although you’ve input all the data, you can’t get it out of the system in a useful fashion.
- Does the vendor provide a data dictionary? A data dictionary is a chart identifying exactly how the database is laid out. If you don’t have this, and you don’t have ways of mapping the database, you will again be very limited in reporting on and extracting data from the application.
- What data formats can I export data to? As discussed, there are a number of formats that data can be stored in. You want a variety of options for industry standard formats.
- Can I connect to the database itself? Particularly if the application is installed on your own local network, you might be access the database directly. The ability to establish an ODBC connection to the data, for instance, can provide a comparatively easy way to extract or update data. Consider, however, what will happen to your interface if the vendor upgrades the database structure.
- Can I initiate data exports without human intervention? Check to see if there are ways to schedule exports, using built-in scheduling features or by saving queries that can be run by the Windows Scheduler (or something similar). If you want to integrate data in real time, determine what user actions you can use to kick off a process. Don’t allow a vendor to lock you out of the database administrator functions for a system installed on your own network.
- Is there an API? APIs can save a lot of time if you’re building a complex data exchange. For some systems, it may be the only way to get data in or out without human intervention. Don’t assume any API is a good API, however – make sure it has the functions that will be useful to you.
- Is there a data exchange ecosystem? Are there consultants who have experience working with the software? Does the software support third party packages that specialize in extracting data from one system, transforming it, and loading it into another? Is there an active community developing add-ons and extensions to the application that might serve some of your needs?
Back to Reality
So, again, what does all of this really mean to a nonprofit organization? From a historical perspective, it means that despite the preponderance of acronyms and the lingering frustrations of some companies limiting their options, integration has gotten easier and better. If you picked up this article thinking that integrating and migrating data between applications and web sites is extremely complex, well, it isn’t, necessarily – it’s sometimes as simple as typing a line in your browser’s address bar. But it all depends on the complexity of the data that you’re working with, and the tools that your software application gives you to manage that data.
For More Information
An Introduction to Integrating Constituent Data: Three Basic Approaches
A higher level, less technical look at data integration options
The SOAP/XML-RPC/REST Saga
A blog article articulating the differences – from a more technical perspective - between REST and SOAP.
Mashup Tools for Consumers
New York Times article on the Mashup phenomenon
W3 Hardcore Data Standard Definition
W3, the standards body for the internet. The hardcore references for HTTP, XML, SOAP, REST and other things mentioned here.
Web API List
Techmagazine’s recent article linking to literally hundreds of applications that have popular Web APIs
Peter Campbell is currently the Director of Information Technology at Earthjustice, an non-profit law firm dedicated to defending the earth. Prior to joining Earthjustice, Peter spent seven years serving as IT Director at Goodwill Industries of San Francisco, San Mateo & Marin Counties, Inc. Peter has been managing technology for non-profits and law firms for over 20 years, and has a broad knowledge of systems, email and the web. In 2003, he won a "Top Technology Innovator" award from InfoWorld for developing a retail reporting system for Goodwill thrift. Peter's focus is on advancing communication, collaboration and efficiency through creative use of the web and other technology platforms. In addition to his work at SF Goodwill, Peter maintains a number of personal and non-profit web sites; blogs on NPTech tools and strategies at http://techcafeteria.com; is active in the non-profit community as member of NTEN; and spends as much quality time as possible with his wife, Linda, and eight year old son, Ethan.