Those who do not want content on their own websites to be used to train OpenAI’s GPT language models can now prevent this. The US company behind ChatGPT has summarized how the web crawler for the AI technology can be blocked in the robots.txt of a website. All you have to do is add the two lines to the file:
Advertisement
User-agent: GPTBot
Disallow: /
In addition, it can also be specified that only certain folders can be used for AI training and others not. OpenAI prescribes the following code for this:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
The OpenAI crawler can therefore be recognized by this string:
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Help improve GPT
Websites visited by the GPTBot “can potentially be used to improve future GPT models,” writes OpenAI. At the same time, the company explains that it will filter out sites that have a paywall, are known to collect personally identifiable information, or have text “that violates our rules.” It is said that anyone who allows the crawler to collect the data can help make AI models more accurate and better overall.
Advertisement
Just a few days ago, OpenAI, together with other AI companies, made a commitment to the US government to watermark AI content in the future and to test the technology intensively before it is launched on the market. There was just as little talk of hints as to which content from the Internet they were trained with, nor of an end to the practice of simply plowing through the Internet for this purpose. With the information for the robots.txt, website operators now have some control back.
OpenAI’s approach is not the first attempt to give content creators the opportunity to decide for themselves whether they want to contribute to artificial intelligence training. As early as November, the online art portal DeviantArt explained to users how they can add a relevant note to their works. For this purpose, the platform had introduced a label “noai”. However, DeviantArt could not influence whether the developers of AI technology would take this into account.
At the same time, the platform had made this much more difficult for the image generator DreamUp, which was introduced at the same time, and only changed it after violent protests. After that, it became an opt-in, only those who wanted their own works to be used as AI material had to take action afterwards. This does not apply to OpenAI’s GPTBot, if you want to block its path to a website, you must now adapt the robots.txt.
(my)
Go to home page
#Exclude #crawlers #training #GPT #OpenAI #introduces #code #robots.txt