Merging Data Sources

About This Case Study

This case study discusses the processes which have been used at BIDS to produce databases suitable for a service environment. The organisation gained extensive experience in this area, both with bibliographic data and with statistical and taxonomic data.

Background

It frequently happens that a database needs to be constructed by taking data from other sources, converting and merging it to form a single database. There are numerous pitfalls in this process caused by errors and inconsistencies in the data. This document discusses where these can occur and suggests ways to handle them. It emphasises the importance of tackling some of the issues before they arise and discusses possible options. The overall approach is a pragmatic one - little is to be gained by trying to insist on perfection and while there is sometimes room for negotiation it is futile to expect too much The fact that the database isn't perfect does not mean that it isn't useful, sometimes extremely useful.

If the initial merger is successful then there will often be a requirement to add further unanticipated data. It discusses what can be done to mitigate the problems that can arise when this occurs.

Preliminaries

This paper attempts to provide some guidance on the problems likely to be encountered when trying to merge two or more databases. In order to avoid hedging every statement and opinion with caveats, I have frequently given the text a didactic tone which should not be taken too literally. There are no hard and fast rules, except perhaps to expect the worst and hopefully be pleasantly surprised when the worst doesn't happen.

If you are proposing to merge databases, this will almost inevitably involve moving one or more of them from a different system and this process brings its own set of tribulations which are discussed in the next section.

Just because it is difficult doesn't mean it isn't worth trying, simply that it is important to be realistic about the limitations of the process. The cumulative errors and inconsistencies between the databases will inevitably mean that the whole is less than the sum of its parts and differences in the world views of the compilers of the various databases will mean that there has to be some loss of resolution. The more databases you are trying to merge, the worse these effects become and the more difficult it is to anticipate the problems, so perhaps the first question you should be asking is whether you want to merge them at all or whether it is better to leave them as separate and identifiable databases, merging the data on the fly when the application requires it. Given that the application can usually be changed rather more easily than the database, this is an option which is always worth a look. It has its drawbacks however. Merging on the fly generally means that the application has to be able to deal with all the errors and inconsistencies as they appear. Moreover, it has to deal with them without necessarily having any prior knowledge of what they might be, so there is an ever present risk that the application can be thrown by some unforeseen eventuality or that the resulting virtual database may present serious errors or misconceptions to the end user as a result of the application mishandling them. At least if you have tried to put them together, you will have had the opportunity to study the data content and hopefully have a clearer view of what you are dealing with.

Moving a Database

Specification and Documentation

Even if these exist they will almost certainly contain errors and omissions which will cause problems. Of course, there is no suggestion that the supplier of the data has deliberately set out to confuse or mislead, quite the reverse, but with the best will in the world, the useful information never makes it from the head onto the page, so while it is useful as a guide, any specification or other documentation should always be treated as potentially incomplete.

Data Samples

Because the documentation should never be relied upon, the start of any serious study is the data itself. Always start by getting large samples of the data and then spend a couple of days (or more) studying them. Time spent doing this is very rarely wasted and you are likely to discover all sorts of interesting things. Each field should be checked through to make sure that it bears at least some relation to what you are expecting. The things to look out for are:

Variations in the formats of personal names
Fields which purport to contain one sort of information, but which have had extra information added in
Fields which have constrained values (e.g. dates) or which should be self checking (e.g. ISSNs & ISBNs) but which have not had their check digits recalculated.

These issues are discussed in more detail below.

If a database covers a substantial time period and has been compiled incrementally, then it is possible that the quality control has improved over time, so it is worth trying to obtain samples from different parts of the time period covered and in any case, just looking at recent data is to be avoided. The meaning of some of the fields have drifted over time, or the very early data may have a lot of errors. In extreme cases, the very early data may be so poor that you could consider discarding it, though there will often be political considerations which preclude this.

Some errors are almost impossible to spot by looking, so it is worth considering writing some scripts or short programs to check fields which you think may have errors. Check digits obviously fall into this category, as do dates. Fields which purport to contain a small range of possible values (e.g. language codes, currency codes) can easily be checked in this way and will often be found to have other, undocumented, values.

In other cases, you might be able to use a script, or a utility like the Unix grep command, to strip out the instances of a particular field from the sample records so that they can be viewed separately on the screen and making unexpected variants easier to identify.

The Problems of Major Variations in Approach

Databases are almost invariably created for some specific purpose (i.e. the data generally hasn't just been collected for the sake of collecting it) and this means that the data is skewed towards this purpose. It will also be skewed towards the creator's view of the world. Data is not a value-free commodity, but a description, often a highly personalised one of some aspect of the world. This description may be at odds with your requirements and the first important task in any data migration is to try to identify the assumptions which have been unconsciously built into the database and to assess the impact which these might have. (There are exceptions. Taxonomic databases, e.g. those covering the classification of plants or animals, have been created with no specific use in view, simply in order to record a section of the taxonomy in a modern, convenient form, but this is unusual.)

A database and its associated applications which have been in existence for some time will have evolved so that the applications make allowances for the vagaries of the data and it is important to bear this in mind.

What might look like egregious errors or inconsistencies to you might be unimportant for the original purpose. An example of this was an economic database containing sets of time-series data and supplied by a major UK institution. Close inspection showed that many of the time series were repeated and that some that were present were incomplete. In fact the data was supplied mostly to other financial institutions who took out the series they needed and discarded the rest. Since these institutions usually wanted the same subset (which was actually correct), errors in the rest were not noticed (and would probably have been considered unimportant). Trying to provide an academic database service was a different matter, since there was no way of knowing which series were going to be important for a particular line of research, so the whole database had to treated equally. This led to some interesting discussions with the supplier which were eventually resolved.

You can start by thinking about the way in which the data is currently being used. It is possible that its current use is to facilitate rapid short term decisions (commercial databases often have this purpose). In this case, timeliness is of the essence and there will often be an acceptance that a loss of accuracy is a price worth paying. Indeed, it may be compulsory since the data may be useless unless it is available in time. For research purposes however, timeliness is generally less important than accuracy, so what is required is a database of record, one which is a source of authoritative and reproducible results.

Another possibility is that the data is not of the correct scale. The author spent many months painfully acquiring a large quantity of meteorological data covering South West England and South Wales in the belief that this would be a broadly useful research resource for the academic community in the South West, only to discover that many people either wanted data about micro-climates (for which it was insufficiently detailed) or about the UK, the North Atlantic, Western Europe, etc (for which it was hopelessly localised) or about some specific location (which wasn't covered). An assumption that the research would follow the availability of resources was shown to be quite unfounded (at least in this case) and usage of the database was very low in spite of efforts to promote it.

Failing to recognise the implications of such strategic mismatches may result in a database which is not fit for the purpose and which, as a result, is little used.

Dealing with errors

Correction by the originators

The people who supplied the data (assuming they are still around) will usually want to be notified of errors that you find, so it is a good idea to negotiate some mechanism for notifying them and for receiving corrections before you start finding them. Keep it simple. People who maintain databases usually don't believe that there are very many errors in them and will happily agree to supply corrections. The arrangements will invariably need to be modified in the light of experience. Using the data for other purposes generally reveals a large number of problems and the suppliers will have underestimated the amount of work involved, so it is also safe to assume that any arrangement, however well-intended, may collapse when put to the test. For this reason it is important to have an alternative strategy for dealing with errors which does not depend on the supplier.

Impossible values

People who construct databases often incorporate very little error checking into the data definition itself even when this could be done automatically by the software. The result is fields in the data which are simply character strings and which can therefore contain pretty much anything.

The classic example is dates. If dates have hitherto been stored simply as character strings, then it is almost certain that impossible values (e.g. 29/2/1957) will have crept in. If you are planning to move the data to a DBMS field which is defined as a date it will certainly refuse to have anything to do with values like this. To make matters worse, the original supplier will probably have no way of recovering the correct value, so the situation is literally irretrievable. Similar problems arise with fields which are supposedly numeric, but which contain invalid characters.

Fixing it

It is generally not a good idea to take an overly rigorous approach to fixing errors and in any case people who use databases are surprisingly tolerant of them. Except in very unusual circumstances, you won't be able to work out the correction for yourself, so the options are basically:

Discard the data. This is always tempting, but is usually a bad idea because it will often have knock-on effects and because the error may be only in a single field in a record.
Leave it as it is, but mark it in some way (perhaps by having a field in your database which indicates that the record is erroneous). This is OK, but gets very complicated if you want to indicate anything more precise than that the record contains one or more errors somewhere.
Change it to some valid value which can be interpreted as meaning that there is an error. Risky, but useful if you have a value which simply cannot be stored in the database as it stands (e.g. dates which are invalid or non-numeric characters in a field defined as numeric). For this reason it is often not a good idea to impose constraints on the data which haven't been imposed before, as this may just cause insoluble problems.
Ignore it. This sounds like the worst of all worlds, but you need to be practical about these things and provided the value is obviously invalid, it may have surprisingly little impact. Most statistical packages, for example, incorporate techniques for dealing with missing values, so the item can simply be discarded during the analysis. This is not to say that errors are unimportant, or that you shouldn't make efforts to find and correct them, but these efforts can soon produce sharply diminishing returns. Errors which are obvious and detectable are actually much less problematic than those which are not algorithmically detectable, such as digit transpositions in numbers.

Semantic Distinctions

A database will sometime contain fields whose meaning has changed over time, or which has been used for more than one purpose and this is one situation where the data can be fixed without incurring penalties. This can occur in financial data which the meaning of certain terms in company accounts can be redefined. The result is that the data is not comparable from year to year, and since this is usually what academics who are using the data want to do, it can cause serious problems if it isn't highlighted in some way. The simplest solutions are either to split it into several fields (one for each meaning) all but one having a null value at any given occurrence, or to invent an extra field which indicates the changing meaning. For data which is to be processed solely by your applications either will do, but for data which is to be exposed to end users, the former is preferable as it makes the change of meaning explicit and difficult to ignore.

Merging

The most obvious thing to remember about merging databases is that there needs to be some basis for merging, i.e. that the databases need to have some common values which provide a link between them. Often this will be some universal unique identifier which has been assigned, e.g. an ISSN for journals or a Stock Exchange Daily Official List (SEDOL) number for quoted companies and their shares. Unfortunately universal unique identifiers are as susceptible to Murphy's Law as everything else. They may not be particularly universal, they may not even be unique and what they identify can be disconcertingly variable. As with all data it is important to make allowances for the unexpected and not to hard wire rigid constraints which will subsequently prove to be unsustainable.

Compatibility

Most of the problems encountered in merging databases arise as a result of trying to make fields which are on the face of it the same (e.g. a personal name) actually compatible. Before embarking on this, it is useful to think about what you are going to do with the data and to remember that filestore isn't usually a constraint, so rather than converting all the various forms of the data into a single reduced form, you could consider holding various copies of the same field. One of these could be (say) the name in its original form, another the name reduced to a common format suitable for searching and another a form suitable for displaying in the results of a search. An advantage of doing this is that you have the original in the database, so you could subsequently change to using that for displaying at some later date, and also because having the original format to hand in the database can be very useful for sorting out queries later. The disadvantage of it is of course that the conversion from the various formats becomes more complex. How you choose to play this trade-off depends on your circumstances.

Mark up

Databases containing text will often contain textual mark up to represent characters which are not normally encountered e.g. letters with accents & other diacritics, Greek or Cyrillic letters, and other signs like up-arrow which don't occur in the usual ASCII Latin alphabets. There will generally be (a) a unique system for representing these in each database and (b) they will all contain a different set of characters. This is one situation where standardising is almost certainly a good idea.

The characters fall into three categories:

Those which are a Latin letter with an added diacritic mark or a symbol representing a ligature. In these cases there is an obvious substitution, so although you might want to display the word "correctly" with the accented character or ligature in the search results, it will probably not be a good idea to insist on the inclusion of the accent for searching purposes, so people can search for "Cote", but see "CÔté" displayed. There are cultural arguments here of course. We live in a largely unaccented world and the accented world could argue that searching for "the" when you are looking for "thé" is not an acceptable way to treat their language. Whatever the rights and wrongs, this argument has pretty much been lost.
Those characters which cannot be converted to a basic Latin equivalent, e.g. Greek or Hebrew letters but for which there is still a need to provide a searchable version. In this case the simplest solution is to replace the Greek letter with the spelled out version e.g. chi-squared, gamma-rays, aleph-zero.
Those characters which have no sensible equivalent for searching purposes, e.g. up-arrow. These are easy to handle from a displaying point of view, but are very difficult to search for without using some decidedly non-obvious replacement text. Fortunately these characters are usually found in situations where searching is not commonly required so this tends not to be a problem.

In some database systems the indexing can be instructed to ignore character strings enclosed by certain delimiters and if this is available it provides a good solution. Alternatively, it may be possible to pre-process the field value before presenting it for indexing (which amounts to the same thing). It is necessary to define delimiters which will enclose the "hidden" text and which are either defined to the DBMS or used in the field pre-processor and also to indicate how many characters the enclosed text need to replace. Supposing that the delimiters are { and } the text can therefore look something like:

... a {pointer to chi.gif 3}chi-squared test on Du C{Ô 1}ot{é 1}e de Chez Swan."

for searching purposes the text reduces to:

".... a chi-squared test on Du Cote de Chez Swan."

Case Conversion

Although as a general rule it is not a good idea to adjust the data content, case conversion is sometimes an exception to this. Old databases will often have text fields which are in upper case and displaying this on the screen has come to look overly aggressive (though it used to be quite normal, of course). Depending on the content of the text, it may be possible to convert it to a more conventional mixed case. Care needs to be taken before embarking on this. The original ISI bibliographic databases provided to BIDS in 1991 were entirely in upper case and some consideration was given to attempting to convert it. The idea was discarded, mainly because of the near impossibility of distinguishing the word "in" from the chemical symbol for Indium ("In"), the symbol for Cobalt ("Co") from the formula for carbon monoxide ("CO"), and similar examples. It was decided that the benefits did not outweigh the potential for confusion which might occur, and that that BIDS could have been accused of corrupting or mishandling the data.

There are other situations however, when this has been done to good effect, usually when there is a restricted vocabulary. Journal names are an example where there is very limited punctuation and no awkward constructs like chemical and mathematical formulae. In this case, it is very easy to parse the string into words and adjust the casing of each word (perhaps putting a limited list of common words like "the" and ऺnd" into all lower case). It is not perfect of course. The simple algorithm converts acronyms like "NATO" into "Nato", so a list of common acronyms needs to be incorporated to stop this, but even allowing for deficiencies, the overall effect is distinctly preferable.

What to keep separate

It is not necessary to merge fields just because they contain notionally the same thing. In some circumstances there will be fine distinctions in the semantics of the fields which would make combining them seriously misleading.

A plant species, for example, may have several names associated with it, either because it has been reclassified (possibly several times) or because it has been named by different authorities on separate occasions. Complex rules have evolved for prioritising these and any manipulation of a taxonomic database will almost certainly need to take this into account.

Updates

All the above problems are compounded when the database needs to be updated. There are generally two possibilities here:

The database is being rebuilt from scratch using updated source data. If it is technically feasible this will usually be the preferable option. Not only does it allow you to correct errors which found their way into the original database (in spite of your efforts) but it also means that you can now make use of those insights which you had just too late the first time around. Unless the database is extremely large or there are other overriding reasons why it is infeasible (for example because a large number of manual fixes have been applied which would need to be reapplied), you should at least consider this option.
The database needs to be updated in situ. This can pose acute technical problems. Remember that in order to put the databases together, you needed some way to tie together related records. Even in the unlikely event that you've managed to reconcile all the relationships and have no records from one database that don't have partners in the other(s), there is no guarantee that this situation will continue and it is very unlikely that it will. Individual databases get updated at different rates, so any update process needs to assume that there will be unresolved relationships at any stage and that these may always be resolved by some subsequent update (or they may not, or the update may produce an additional erroneous resolution). How difficult it is to solve these problems depends on the regularity of the updates and how time critical they are. Basically you are trading effort against accuracy. If the updates arrive daily or weekly then it is unlikely that you'll be able to afford the luxury of manual intervention in the updating process and you will have to live with the results.

And Now We Have Another Database

Any merging operation which is even moderately successful runs the risk of being required to incorporate another unknown and unanticipated data source. It is usually futile to point out that one of the reasons the initial merger was a success is because all the databases were known about beforehand, and in any case, you are being paid the biggest compliment possible, so you might as well make the most of it.

Now however, is not the time to start thinking about it. It is always a good idea to behave from the start as though this were bound to happen and there are a number of fairly elementary things to bear in mind.

Fields which contain classification values, e.g. a language name or code should always allow for additional values.
Fields in the database should be disaggregated where possible, so if a supplied text field actually contains 2 or more (probably closely related) items of data and it is possible to disassemble these into separate fields, then do this. An example is journal names. There are numerous examples of journals (usually at the more popular end rather than hard-bitten academics) which have the same name and in this case it is normal to distinguish them by appending the place of publication (separated by some suitable delimiter). In this case it is usually a simple matter to split this into two fields.

About This Document

This case study was written by Clive Massey who worked on the original BIDS system and was subsequently involved in many aspects of the services, including database and User Interface design. He was later employed by Ingenta, initially as Help Desk Manager and then on database design and administration.