Global-hopping Customer Data

Global-hopping Customer Data (Part 2)

Jan 25, 2010

Ramesh Menon

Continuing the discussion started in Globe-hopping Customer Data I'd like to examine the issues related to identifying which customers you do NOT want to do business with.

One of our customers - a major bank in - learned first-hand the complexities that arise from having to comply with watchlists from multiple countries. At a minimum, the bank had to screen customer data against the n government watchlists (in Arabic script) and the US government watchlists (in Romanized/English). To further complicate matters - the bank's customer data was stored in two separate databases - Saudi customer data in Arabic script, and all other customers in Romanized form.

Business have long understood the costs and missed revenue opportunities from mis-matching customer data for customer service or marketing needs. But missing watchlist matches when the penalty could be hundreds of millions dollars, or potential prosecution for executives means an entirely different level of risk. It poses new challenges when customer data and watchlists are in very different language encodings.

Let's look at Arabic data in particular - and the challenges it can pose for computer systems: We've all heard about the different ways one can spell the very common name Mohammed. One commonly touted solution to this problem is transliteration. This is the process of "translating" a word or character from one writing system to another. While a good idea in practice, the messy reality of identity data and its use is such that this is not a perfect solution.

Let's take the name

which transliterates to KADR.

However these two Arabic characters

and

are close together on a keyboard, and a simple typo could mean that the original name appears as

If we now transliterate the second name (with typo) we end up with a very different Romanized version - FADR.

Sophisticated identity resolution algorithms can handle the typos in the native script just like English-speakers would resolve a typo where an "M" has been typed instead of an "N". On the other hand, transliteration has lost a lot of the context in the original encoding and ended up with two very different Romanized words. This is why we recommend that our customers keep customer name data in the original encoding and base matching decisions on that original form using identity resolution- rather than relying solely on lossy conversions like transliteration.

Here's another reason why transliteration-based matching introduces more problems than it solves: There isn't a single standard way to transliterate!

Different transliterations are used by native speakers of different countries and languages.

This Arabic name can be:

NORTHERN AFRICA: Hadj Abdelhassane Ben Brahim

MIDDLE EAST: Haj Abbul Hacen Ibn Ibarahim

EAST AFRICA: Hag Abdul Hasen Ibrahem

ARABIC PENINSULA: Haji Hassan Abdabrahim

Transliteration is a valuable tool for certain uses - especially when it is necessary for human beings to use data that was sourced in unfamiliar scripts. However, compliance screening poses very large risks in cases of missed matches, and data in it's original encoding contains the richest elements for match decisions.

Hence we recommend that when the risks are high, our customers rely on the original sourced identity data for matching, and use hybrid identity resolution algorithms to overcome the variations, errors and character set considerations for the most peace of mind.

(Original online location)

Practical International Data Management Online. A free resource from GRC Data Intelligence. For comments, questions or feedback: pidm@grcdi.nl