I wish I was called Aaron Aardvark. When I search for “Richard Jones” on Linkedin I get 2,685 hits! Names run in families. My father’s name was John Jones, my brother is John Jones, my grandfather was John Jones. I am not consistent with my name, my middle name is Lloyd. Is it part of my surname? My birth certificate says not. But people shuffle it there, If I cannot be found by a medical receptionist I say “try under L”. Search on Linkedin for “Richard Lloyd Jones” and you still get 15 hits (not including me). I have another middle name Patrick, It’s not on my birth certificate, so it’s not on my passport, but I am known as Richard P L Jones in some places. When I worked in Malaysia many years ago, I was fairly universally called Dr Richard, because family names come first in Asia. With the remains of a Welsh accent, over the phone I still often have to say ‘ No!!! Not Johns, JONES- J O N …’

This sad little saga illustrates three aspects of identifying a person by their names. They are not unique, they are not used consistently and they are misspelt, whether by accident or with deliberate intent. But people have to be identified accurately, especially electronically, whether in the context of travel, financial or health services, law enforcement, etc. so clearly a name is not enough.

Some other dimensions associated with a person are needed to identify them uniquely or at least more accurately. It is not coincidental that a passport contains the triplet: name, date of birth, place of birth. This is seen as a gold standard for identity verification. It does rely on the verification of the three elements by a birth certificate, It seems birth certificates are relatively easy to come by; the press has recently described the theft of identity, by both spies and undercover police, usually from infants who died at birth.

Identity verification via date and place of birth is not always possible, because that information is not reliably available or requested. Information that is often available is current address, email address and phone numbers. These are clearly not life-long values, though a mobile phone number is surprisingly seen in fraud investigations as the most enduring parameter. Physical addresses suffer from some of the same problems as personal names, I always write Ct. as my abbreviation for Circuit in my address. I very recently found out that Ct. is the standard abbreviation for Court. Circuit is Cct.! Suburbs do change postcodes; some people prefer a more salubrious town or suburb. My wife’s aunt in the UK always insisted she lived in Windsor not Slough. My mother’s year of birth fluctuated over a few years.

So if it is necessary to compare two records to determine how likely it is that they refer to the same person then the differences between the different fields (name, address, age, etc) need to be individually computed and then combined in some way. The relative frequency of fields (e.g. Aardvark vs. Jones) must be taken into account and some thresholds defined as to whether either: this is, is not, or might be a match. A complication is likely to be that records from different sources with not have the same fields to compare. For an excellent in-depth and very technical review of current methods, see a recent book by Peter Christen, a colleague of mine at Australian National University[1]

Given the importance of the problem, not surprisingly there are many commercial offerings and services for different contexts. These contexts include deduplication, popular when two companies merge and need to combine their customer lists, fraud and felony detection, where people are deliberately trying hide their identities, and identity matching in areas such as health care, where it is important not to open a new record about a person that is already known to the system, but where in general people are not trying to beat the system.

Encompass Corporation, are partnering with GDC (Global Data Company)[2] to offer instant verification on persons and organisations. For a fee, GDC claim to match against 3 billion records drawn from 30 countries. They will undoubtedly use some of the advanced techniques described by Christen.

So identity verification is definitely not for beginners. I am doing research for Encompass on a novel way to do name matching, one component of the problem that has interested me for some time. It is looking quite promising, more about that another day. Oh and I am still wondering if Aaron Aardvark is an improvement as a name or not!


[1] Peter Christen, “Data Matching”, (2012), Springer Press, ISBN 978-3-642-31163-5

