The furor surrounding ChatGPT remains at a fever pitch as the ins and outs of the AI chatbot’s potential continue to make headlines. One issue that has caught the attention of many in the security field is whether the technology’s ingestion of sensitive business data puts organizations at risk. There is some fear that if one inputs sensitive information — quarterly reports, materials for an internal presentation, sales numbers, or the like — and asks ChatGPT to write text around it, that anyone could gain information on that company simply by asking ChatGPT about it later.
The implications of that could be far-reaching: Imagine working on an internal presentation that contained new corporate data revealing a corporate problem to be discussed at a board meeting. Letting that proprietary information out into the wild could undermine stock price, consumer attitudes, and client confidence. Even worse, a legal item on the agenda being leaked could expose a company to real liability. But could any of these things actually happen just from things put into a chatbot?
This concept was explored by research firm Cyberhaven in February, concentrating on how OpenAI uses the content people put into ChatGPT as training data to improve its technology, with output closely resembling what was input. Cyberhaven claimed that confidential data input into ChatGPT could potentially be revealed to a third party if they were to ask ChatGPT certain questions based on the information the executive provided.
ChatGPT doesn’t store users’ input data — does it?
The UK’s National Cyber Security Centre (NCSC) shared further insight on the matter in March, stating that ChatGPT and other large language models (LLMs) do not currently add information automatically from queries to models for others to query. That is, including information in a query will not result in that potentially private data being incorporated into the LLM. “However, the query will be visible to the organization providing the LLM (so in the case of ChatGPT, to OpenAI),” it wrote.
“Those queries are stored and will almost certainly be used for developing the LLM service or model at some point. This could mean that the LLM provider (or its partners/contractors) are able to read queries and may incorporate them in some way into future versions,” it added. Another risk, which increases as more organizations produce and use LLMs, is that queries stored online may be hacked, leaked, or accidentally made publicly accessible, the NCSC wrote.
Ultimately, there is genuine cause for concern regarding sensitive business data being inputted into and used by ChatGPT, although the risks are likely less pervasive than some headlines make out.
Likely risks of inputting sensitive data to ChatGPT
LLMs exhibit an emergent behavior called in-context learning. During a session, as the model receives inputs, it can become conditioned to perform tasks based upon the context contained within those inputs. “This is likely the phenomenon people are referring to when they worry about information leakage. However, it is not possible for information from one user’s session to leak to another’s,” Andy Patel, senior researcher at WithSecure, tells CSO. “Another concern is that prompts entered into the ChatGPT interface will be collected and used in future training data.”
Although it’s valid to be concerned that chatbots will ingest and then regurgitate sensitive information, a new model would need to be trained in order to incorporate that data, Patel says. Training LLMs is an expensive and lengthy procedure, and he says he would be surprised if a model were trained on data collected by ChatGPT in the near future. “If a new model is eventually created that includes collected ChatGPT prompts, our fears turn to membership inference attacks. Such attacks have the potential to expose credit card numbers or personal information that were in the training data. However, no membership inference attacks have been demonstrated against the LLMs powering ChatGPT and other similar systems.” That means it’s extremely unlikely that future models would be susceptible to membership inference attacks, though Patel admits it’s possible that the database containing saved prompts could be hacked or leaked.
Third-party linkages to AI could expose data
Issues are most likely to arise from external providers who do not explicitly state their privacy policies, so using them with otherwise secure tools and platforms can put any data that would be private at risk, says Wicus Ross, senior security researcher at Orange Cyberdefense. “SaaS platforms such as Slack and Microsoft Teams have clear data and processing boundaries and a low risk of data being exposed to third parties. However, these clear lines can quickly become blurred if the services are augmented with third-party add-ons or bots that need to interact with users, irrespective of whether they are linked to AI,” he says. “In the absence of a clear explicit statement where the third-party processor guarantees that the information will not leak, you must assume it is no longer private.”
Aside from sensitive data being shared by regular users, companies should also be aware of prompt injection attacks that could reveal previous instructions provided by developers when tuning the tool or make it ignore previously programmed directives, Neil Thacker, Netskope’s CISO for EMEA, tells CSO. “Recent examples include Twitter pranksters changing the bot’s behavior and issues with Bing Chat, where researchers found a way to make ChatGPT disclose previous instructions likely written by Microsoft that should be hidden.”
Control what data is submitted to ChatGPT
Sensitive data currently makes up 11% of what employees paste into ChatGPT, with the average company leaking sensitive data to ChatGPT hundreds of times each week, according to Cyberhaven. “ChatGPT is moving from hype into the real world and organizations are experimenting with practical implementation across their business to join their other ML/AI-based tools, but there needs to be some caution applied, especially when it comes to the sharing of confidential information,” Thacker says. “Consideration should be made about aspects of the data ownership and what the potential impact is if the organization hosting the data is breached. As a simple exercise, information security professionals should, at a minimum, be able to identify the category of data that is potentially accessible in the event of a breach of these services.”
Ultimately, it is a business’s responsibility to ensure its users are fully aware of what information should and shouldn’t be disclosed to ChatGPT. Organizations should take great care with the data they choose to submit in prompts, the NCSC says: “You should ensure that those who want to experiment with LLMs are able to, but in a way that doesn’t place organizational data at risk.”
Warn workers about the potential danger of chatbots
However, identifying and controlling the data employees submit to ChatGPT isn’t without challenge, Cyberhaven warned. “When workers input company data into ChatGPT, they don’t upload a file but rather copy and paste content into their web browser. Many security products are designed around protecting files (which are tagged confidential) from being uploaded but once content is copied out of the file they are unable to keep track of it,” it wrote. What’s more, company data going to ChatGPT often doesn’t contain a recognizable pattern that security tools look for, such as a credit card number or Social Security number, Cyberhaven said. “Without knowing more about its context, security tools today can’t tell the difference between someone inputting the cafeteria menu and the company’s M&A plans.”
For improved visibility, organizations should implement policies on their secure web gateways (SWG) to identify the use of AI tools and can also apply data loss prevention (DLP) policies to identify what data is being submitted to these tools, Thacker says.
Organizations should update information protection policies to ensure that the types of applications that are acceptable handlers of confidential data are well documented, says Michael Covington, vice president of portfolio strategy at Jamf. “Controlling that flow of information begins with a well-documented and informed policy,” he says. “Additionally, organizations should be exploring how they can utilize these new technologies to improve their businesses in a thoughtful way. Don’t shy away from these services out of fear and uncertainty but dedicate some staffing to explore new tools that show potential so you can understand the risks early and ensure adequate protections are in place when early end-user adopters want to start using the tools.”