Skip to main content

Redacting custom strings

In this tutorial, we will explore how to use the Cape API to perform custom redactions on sensitive text and files. In addition to redacting PII, PCI and PHI, the Cape API allows you to de-identify further sensitive data, using custom regular expression patterns. We will cover two scenarios: custom redactions for text and custom redactions for files.

Get a Cape API Key

This tutorial assumes you have an environment variable CAPE_API_KEY that contains an API Key. You will need to signup for a Cape account, and then you can get an API key here.

Custom redactions for text

In this scenario, we will perform custom redactions on a sensitive text using the Cape API.

import os
import requests

def deidentify_text_with_custom_redaction(content, entity_detection):
url = "https://api.capeprivacy.com/v1/privacy/deidentify/text"
headers = {
"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"
}
payload = {
"content": content,
"entity_detection": entity_detection
}

response = requests.post(url, headers=headers, json=payload)

response_data = response.json()

if response.status_code == 200:
return response_data['content'], response_data['entities']

else:
return None, response_data.get('detail', 'Unknown error')

content = "Bob(DEF289) is Claire's(PLA597) friend who lives in Eutopia! They have been friends since Bob was a toddler."
entity_detection = {
"entity_types": [],
"filter": [
{
"type": "BLOCK",
"pattern": "Bob",
"entity_type": "DEFENDANT"
},
{
"type": "BLOCK",
"pattern": "[A-Za-z]{3}\\d{3}",
"entity_type": "ID"
},
{
"type": "BLOCK",
"pattern": "Eutopia",
"entity_type": "LOCATION"
},
{
"type": "ALLOW",
"pattern": "toddler"
}
]
}
deidentified_text, entities = deidentify_text_with_custom_redaction(content, entity_detection)
if deidentified_text is not None:
print("Deidentified Text:", deidentified_text)
print("Entities:", entities)
else:
print("Error:", entities)

Output
{
"content": "[DEFENDANT_1]([ID_1]) is [NAME_GIVEN_1]'s([ID_2]) friend who lives in [LOCATION_1]! They have been friends since [DEFENDANT_1] was a toddler.",
"entities": [
{
"processed_text": "DEFENDANT_1",
"text": "Bob",
"best_label": "DEFENDANT"
},
{
"processed_text": "ID_1",
"text": "DEF289",
"best_label": "ID"
},
{
"processed_text": "NAME_GIVEN_1",
"text": "Claire",
"best_label": "NAME_GIVEN"
},
{
"processed_text": "ID_2",
"text": "PLA597",
"best_label": "ID"
},
{
"processed_text": "LOCATION_1",
"text": "Eutopia",
"best_label": "LOCATION"
}
]
}

Custom redactions for files

In this scenario, we will perform custom redactions on a sensitive file using the Cape API. Let's assume you have a file named my_file.txt with content as follows in the same directory as your script.

Bob(DEF289) is Claire's(PLA597) friend who lives in Eutopia! 
They have been friends since Bob was a toddler.
Their bond has weathered the test of time....
import os
import requests
import json

def deidentify_file_with_custom_redaction(file_path, entity_detection):
url = "https://api.capeprivacy.com/v1/privacy/deidentify/file"
headers = {
"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"
}

with open(file_path, 'rb') as file:
files = {"file": (file_path, file)}

# Prepare entity_detection data as a tuple.
data = ("entity_detection", json.dumps(entity_detection)),

response = requests.post(url, headers=headers, files=files, data=data)

response_data = response.json()

if response.status_code == 200:
return response_data['content'], response_data['entities']
else:
return None, response_data.get('detail', 'Unknown error')

file_path = 'my_file.txt'
entity_detection = {
"entity_types": [],
"filter": [
{
"type": "BLOCK",
"pattern": "Bob",
"entity_type": "DEFENDANT"
},
{
"type": "BLOCK",
"pattern": "[A-Za-z]{3}\\d{3}",
"entity_type": "ID"
},
{
"type": "BLOCK",
"pattern": "Eutopia",
"entity_type": "LOCATION"
},
{
"type": "ALLOW",
"pattern": "toddler"
}
]
}

deidentified_text, entities = deidentify_file_with_custom_redaction(file_path, entity_detection)
if deidentified_text is not None:
print("Deidentified Text:", deidentified_text)
print("Entities:", entities)
else:
print("Error:", entities)

Output
{
"content": "[DEFENDANT_1]([ID_1]) is [NAME_GIVEN_1]'s([ID_2]) friend who lives in [LOCATION_1]! \nThey have been friends since [DEFENDANT_1] was a toddler.\nTheir bond has weathered the test of time....",
"entities": [
{
"processed_text": "DEFENDANT_1",
"text": "Bob",
"best_label": "DEFENDANT"
},
{
"processed_text":"ID_1",
"text":"DEF289",
"best_label":"ID"
},
{
"processed_text": "NAME_GIVEN_1",
"text": "Claire",
"best_label": "NAME_GIVEN"
},
{
"processed_text":"ID_2",
"text":"PLA597",
"best_label":"ID"
},
{
"processed_text": "LOCATION_1",
"text": "Eutopia",
"best_label": "LOCATION"
}
]
}

In the response, you will see that Bob has been redacted as DEFENDANT instead of NAME and Eutopia as LOCATION according to the custom patterns. It also targets patterns like [A-Za-z]{3}\d{3}"\ (e.g., "DEF289" and "PLA597"), classifying them as IDs and redacting them as well. However, the word "toddler" is allowed to remain visible, as indicated by the ALLOW filter type, which would have been redacted as AGE. The entities array provides information about the processed entities, including their processed text, original text, and best label.

The pattern parameter can be a word or a regex pattern. Refer python regular expression operations for guidance on crafting python regex patterns.

tip

You can also use the filter parameter for exclusive redactions. For example, to redact only Bob and no other name as NAME, enable all entity types that you want to be redacted in entity_types (exclude NAME) and set the BlockFilter to detect Bob as NAME. Check out entity_types schema for the full list of supported entity types.

And that's a wrap! In this tutorial, you've gained valuable insights into harnessing the power of the Cape API for custom redactions on sensitive text and files, ensuring the privacy and confidentiality of your data. By specifying the filter with appropriate filter type and regex, you can tweak what to redact and what to retain. Head over to API Docs for more information on our privacy endpoints.