Integrating with OpenAI
This tutorial shows you how to send data to the OpenAI chat completions api with PII removed. See the previous tutorial for an introduction to the Cape API.
Get an API Key
This tutorial assumes you have an environment variable CAPE_API_KEY
that contains an API Key.
See the previous tutorial if you have not done this.
Get an OpenAI Key
You will also need a key to access the OpenAI API. See the OpenAI documentation for more details.
De-identify data with Cape
To start, we will send a simple de-identified text to OpenAI.
First, let's define a deidentify
function that can de-identify input strings
When working with de-identified information, we find it is often useful to include that you are doing so in your prompt.
import os
import requests
def deidentify(content, entities=[]):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/deidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)
return resp.json().get("content"), resp.json().get("entities")
text, entities = deidentify("Bob is a software engineer who works at Cape Privacy!")
print(text, entities)
Output
# text
"[NAME_GIVEN_1] is a [OCCUPATION_1] who works at [ORGANIZATION_1]!"
# entities
[
{
"processed_text":"NAME_GIVEN_1",
"text":"Bob",
"best_label":"NAME_GIVEN",
},
{
"processed_text":"OCCUPATION_1",
"text":"software engineer",
"best_label":"OCCUPATION",
},
{
"processed_text":"ORGANIZATION_1",
"text":"Cape Privacy",
"best_label":"ORGANIZATION",
}
]
When we de-identify our text, the de-identified text as well as an entities
dictionary is returned (this will be used later to re-identify what comes back from OpenAI). The entities
dictionary
is managed by the API caller for maximum flexibility. And the list of redacted information isn't stored anywhere, so when
re-identification happens we'll have to pass in the entities
dictionary.
If we look at the content
field from the response, we will see [NAME_GIVEN_1] works at [ORGANIZATION_1].
. We are working with sensitive data!
When we make our second call to deidentify
, we pass the entities
dictionary that we got from the first call. This causes cape to redact entities in a deterministic way, it also
maintains all the entities you ever create, which makes re-identifying easier. If we didn't do this, then the second deidentify
call would not return any entity information about
the organization (Cape Privacy). Since the chat completion response does contain a reference to the de-identified organization ([ORGANIZATION_1]
), we would need to ensure our call to
reidentify
has this org reference.
Send de-identified data to OpenAI
We can then use this function to de-identify data before we send it to OpenAI.
import os
import requests
import openai
def deidentify(content, entities=[]):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/deidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)
return resp.json().get("content"), resp.json().get("entities")
document, entities = deidentify("Bob is a software engineer who works at Cape Privacy!")
question, entities = deidentify("Where does Bob work?", entities=entities)
resp = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant. You are to be given a redacted document and you will answer questions about it. Use the redacted placeholders in your answer, don't say that you do not know"},
{"role": "system", "content": document},
{"role": "user", "content": question},
]
)
print(resp)
Output
{
"id": "chatcmpl-7TaAMex6LD7rhwGob6C0dz1b2GiV8",
"object": "chat.completion",
"created": 1687284890,
"model": "gpt-3.5-turbo-0301",
"usage": {
"prompt_tokens": 96,
"completion_tokens": 16,
"total_tokens": 112
},
"choices": [
{
"message": {
"role": "assistant",
"content": "[NAME_GIVEN_1] works at [ORGANIZATION_1]."
},
"finish_reason": "stop",
"index": 0
}
]
}
Revealing our data with Cape
import os
import requests
import openai
def reidentify(content, entities):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/reidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)
return resp.json().get("content")
def deidentify(content, entities=[]):
resp = requests.post(
"https://api.capeprivacy.com/v1/privacy/deidentify/text",
headers={"Authorization": f"Bearer {os.getenv('CAPE_API_KEY')}"},
json={"content": content, "entities": entities}
)
return resp.json().get("content"), resp.json().get("entities")
document, entities = deidentify("Bob is a software engineer who works at Cape Privacy!")
question, entities = deidentify("Where does Bob work?", entities=entities)
resp = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant. You are to be given a redacted document and you will answer questions about it. Use the redacted placeholders in your answer, don't say that you do not know"},
{"role": "system", "content": document},
{"role": "user", "content": question},
]
)
reidentified = reidentify(resp["choices"][0]["message"]["content"], entities)
print(reidentified)
Output
"Bob works at Cape Privacy"
When we re-identify text, we take our response from OpenAI, as well as the entities
from above, and send to https://api.capeprivacy.com/v1/privacy/reidentify/text
.
This will substitute the de-identified placeholders in the message content with the real values from the entities
dictionary.
You might have realized that clients have enough information to do re-identification locally, and that is true!
You can iterate through the entities dictionary and replace any occurrences of the processed_text
value in the OpenAI response with the text
value.