Using personal data in prompts? A solution proposal

The data protection-compliant use of generative AI models at universities is currently a highly debated topic. While the use of open source models does not seem feasible in the short term due to a lack of existing infrastructure, solutions are already being discussed that avoid the transfer of personal data of university members to commercial providers. But what happens to any personal data that is entered into prompts and thus passed on to commercial providers? I would like to outline a proposed solution to this problem below.

Two main solutions are currently being discussed in the higher education sector for providing GDPR-compliant access to large language models:

The development of a digitally sovereign infrastructure based on an open source LLM: While smaller models such as Vicuna (7B) or LlaMA 2 (7B) can even run locally on your own computer on reasonably high-performance, up-to-date hardware, this is not currently the case for larger models such as Smaug (72 billion parameters). At the same time, Smaug is an open model that even achieves better results in benchmarks than GPT 3.5 or Gemini.
Access to commercial language models via APIs, such as those provided by OpenAI: As access does not happen directly via OpenAI, but university members access the language models via an interface provided by the university or a university association, there is no transfer of user data to the usually US-based LLM providers.

Open source – the ideal solution

While a digitally sovereign open source solution is certainly the ideal way to solve the data protection problem, it does come with a few hurdles in the short term: The feasibility in terms of hardware requirements of a sufficiently high-performance infrastructure that can handle simultaneous requests from thousands of students still needs to be investigated first. A commitment from state and federal politicians is needed to provide the necessary resources. After all, such an infrastructure would have to be set up and – given the developments in the field of generative AI, this is no trivial task – kept up to date.

However, a dedicated infrastructure would offer the enormous advantage of being able to track exactly where one’s own data is being processed. This includes not only the personal data of users belonging to the university, but also the data given to the language model for processing by means of prompts, which may also be personal data.

While such a solution is desirable and its feasibility should be examined in any case, it is unlikely to be feasible in the short term. For this reason, many universities are currently considering the second option mentioned above.

Pragmatic (bridge) solution: use of commercial APIs

APIs (Application Programming Interfaces) are interfaces provided by software providers to interact with other software. As a programmer, I can use the APIs provided by OpenAI to integrate the functionality of the text and image generators into my own programs, for example. This allows universities to build their own user interfaces, which in turn “talk” to the APIs of commercial providers. The advantage: the university members only talk to the university interface. The commercial provider does not gain access to the user data and cannot assign the requests to individual users.

Another advantage: as many providers make APIs available, a large number of competing AI models could be offered in parallel via their own interface. University members could try out whether GPT4-Turbo, Gemini 1.5 or Claude Opus is most suitable for their tasks. At the same time, a certain degree of independence from a single provider would be achieved.

Most providers of AI models charge for the use of their APIs based on volume, usually per million tokens (one token usually corresponds to slightly less than one word). It is noticeable that the latest models are often many times more expensive than the previous versions. For example, OpenAI currently charges USD 10 / USD 30 (output/input) per 1 million tokens for GPT4-Turbo, compared to USD 0.50 / USD 1.50 for GPT3.5-Turbo.

For cost control, a user management with credit system could be provided via the university interface – ideally linked to the single sign-on of the respective university. This would allow each university member to decide for themselves whether credit-intensive GPT4 access is necessary for the current task or whether GPT3.5 already delivers sufficiently good results. The overall costs would thus be capped. It would also be possible for users to purchase additional credit as required.

All very well – but how to deal with personal data in prompts?

While the user data is protected from access by commercial providers in the way outlined above, this does not apply to any personal data entered into prompts. This data continues to reach the provider’s servers via API, where it is processed further. Even terms of use that prohibit the use of personal data in prompts probably cannot fully solve this problem, as it is not only difficult to control, but such a ban would also prohibit some (sensible) application scenarios of the language models.

One solution could be automated local pre-processing of the prompts before they are passed on to the APIs. If a local system automatically recognizes and anonymizes/pseudonymizes personal data in the prompts, the problem of personal data in prompts would be solved or at least significantly reduced. Pseudonymization would also have the advantage that the local system could create a temporary pseudonym glossary, which it could use to translate back the (also pseudonymized) response from the LLM. Ideally, the end user would not notice this process at all; they could therefore use commercial LLMs with personal data with a clear conscience without the data ending up with the LLM provider.

And how to implement this?

A lightweight language model could certainly be fine-tuned for the automated recognition of personal data and the creation of a pseudonym glossary. But why do the work when solutions already exist?

Microsoft has published Presidio, an open source tool, under an MIT license that automatically anonymizes data and can be adapted for the purpose at hand. Providing a pseudonym glossary as outlined above should also not be too complicated with Presidio.

If you would like to try out the tool, you can do so with the demo provided. My tests showed that Presidio recognizes personal data quite reliably, although one or two false-negative dates remained in the text. Fine-tuning the settings will certainly produce even better results.