Skip to main content

Integrating with OpenAI

This tutorial shows you how to send data to the OpenAI chat completions api with PII removed. See the previous tutorial for an introduction to the Cape API.

Get an API Key

This tutorial assumes you have an environment variable CAPE_API_KEY that contains an API Key. See the previous tutorial if you have not done this.

Get an OpenAI Key

You will also need a key to access the OpenAI API. See the OpenAI documentation for more details.

De-identify data with Cape

To start, we will send a simple de-identified text to OpenAI.

First, let's define a deidentify function that can de-identify input strings

tip

When working with de-identified information, we find it is often useful to include that you are doing so in your prompt.

import os
import requests


def deidentify(content, entities=[]):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/deidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)

return resp.json().get("content"), resp.json().get("entities")


text, entities = deidentify("Bob is a software engineer who works at Cape Privacy!")
print(text, entities)
Output
# text
"[NAME_GIVEN_1] is a [OCCUPATION_1] who works at [ORGANIZATION_1]!"

# entities
[
{
"processed_text":"NAME_GIVEN_1",
"text":"Bob",
"best_label":"NAME_GIVEN",
},
{
"processed_text":"OCCUPATION_1",
"text":"software engineer",
"best_label":"OCCUPATION",
},
{
"processed_text":"ORGANIZATION_1",
"text":"Cape Privacy",
"best_label":"ORGANIZATION",
}
]

When we de-identify our text, the de-identified text as well as an entities dictionary is returned (this will be used later to re-identify what comes back from OpenAI). The entities dictionary is managed by the API caller for maximum flexibility. And the list of redacted information isn't stored anywhere, so when re-identification happens we'll have to pass in the entities dictionary.

If we look at the content field from the response, we will see [NAME_GIVEN_1] works at [ORGANIZATION_1].. We are working with sensitive data!

tip

When we make our second call to deidentify, we pass the entities dictionary that we got from the first call. This causes cape to redact entities in a deterministic way, it also maintains all the entities you ever create, which makes re-identifying easier. If we didn't do this, then the second deidentify call would not return any entity information about the organization (Cape Privacy). Since the chat completion response does contain a reference to the de-identified organization ([ORGANIZATION_1]), we would need to ensure our call to reidentify has this org reference.

Send de-identified data to OpenAI

We can then use this function to de-identify data before we send it to OpenAI.

import os
import requests
import openai


def deidentify(content, entities=[]):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/deidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)

return resp.json().get("content"), resp.json().get("entities")

document, entities = deidentify("Bob is a software engineer who works at Cape Privacy!")
question, entities = deidentify("Where does Bob work?", entities=entities)

resp = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant. You are to be given a redacted document and you will answer questions about it. Use the redacted placeholders in your answer, don't say that you do not know"},
{"role": "system", "content": document},
{"role": "user", "content": question},
]
)
print(resp)
Output
{
"id": "chatcmpl-7TaAMex6LD7rhwGob6C0dz1b2GiV8",
"object": "chat.completion",
"created": 1687284890,
"model": "gpt-3.5-turbo-0301",
"usage": {
"prompt_tokens": 96,
"completion_tokens": 16,
"total_tokens": 112
},
"choices": [
{
"message": {
"role": "assistant",
"content": "[NAME_GIVEN_1] works at [ORGANIZATION_1]."
},
"finish_reason": "stop",
"index": 0
}
]
}

Revealing our data with Cape

import os
import requests
import openai


def reidentify(content, entities):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/reidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)

return resp.json().get("content")

def deidentify(content, entities=[]):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/deidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)

return resp.json().get("content"), resp.json().get("entities")

document, entities = deidentify("Bob is a software engineer who works at Cape Privacy!")
question, entities = deidentify("Where does Bob work?", entities=entities)

resp = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant. You are to be given a redacted document and you will answer questions about it. Use the redacted placeholders in your answer, don't say that you do not know"},
{"role": "system", "content": document},
{"role": "user", "content": question},
]
)

reidentified = reidentify(resp["choices"][0]["message"]["content"], entities)
print(reidentified)
Output
"Bob works at Cape Privacy"

When we re-identify text, we take our response from OpenAI, as well as the entities from above, and send to https://api.capeprivacy.com/v1/privacy/reidentify/text. This will substitute the de-identified placeholders in the message content with the real values from the entities dictionary.

tip

You might have realized that clients have enough information to do re-identification locally, and that is true! You can iterate through the entities dictionary and replace any occurrences of the processed_text value in the OpenAI response with the text value.