Posted: March 22, 2022 Posted By: Abhinay Mehta
This article comes from a very specific problem that arose on a real-life project, and was addressed with success through the described methodology.
Imagine that you have a huge number of documents (in the millions) containing personal data about some people. You need to label these documents using crowdsourcing websites like Mechanical Turk.
Before you can send the millions of documents to strangers around the world, you need to make sure all personally identifiable information (PII) from within the text documents are removed.
The list of data entities you need to consider is rather long, it includes names, home addresses, phone numbers, credit card numbers, IP addresses, etc. It would take a rather long time to go through 10 million documents manually therefore this requires an automated solution.
Trying to solve this problem on your own is rather difficult, as identifying names, locations, etc. from text is known in the world of AI and Natural Language Processing as Named Entity Recognition. There are several open source tools and service providers that attempt to help you with this problem.
But if you were based in the UK (as we are) and needed to find British specific data from text such as UK addresses, UK phone numbers, etc. you would find there aren’t many options to choose from. There is an online service for us UK folks called London Analytics that attempts to do exactly this.
London Analytics provides an online tool and an API to help find PII in text. After contacting London Analytics for an API Key, you can do the following.
Prerequisite:
> $ python -m pip install requests
Get a list of items the API could identify as potentially meaningful from some text:
This prints:
Now you can go through the list of data types you’re interested in, for us this included: PERSON, STREET_ADDRESS, POST_CODE, PHONE, EMAIL, IP, IPV6, CREDIT_CARD , and REFERENCE. We used this information to anonymize our text like this:
This prints:
Caveat: It should go without saying that no service is perfect, and of course PII will slip through the cracks here and there, so you need to do your own risk assessment on how best to use this service.
Apart from helping us with anonymization, finding out whether documents contain PII has been useful for several other reasons: