is your website training AI: typos

This commit is contained in:
Wouter Groeneveld 2023-04-23 18:19:00 +02:00
parent 4c17bc1a06
commit 0e67f8ebc4
1 changed files with 5 additions and 3 deletions

View File

@ -30,6 +30,8 @@ I have also blocked the following user-agents: `ChatGPT-User`, `Mediapartners-Go
Of course, it would be very naive of me to think the problem is solved now: first, the damage is already done and there seems to be no way to remove your site from an existing data set (why?); second, who says crawlers will play ball and obey your `robots.txt` file?; third, should we be blocking CCBot in the first place? Common Crawl states "Our goal is to democratize the data so everyone, not just big companies, can do high quality research and analysis.". Then again, by providing such a data set for "everyone", it also can be easily abused by "everyone", including big tech. So I don't know. I'm interested to hear the opinion of an expert on this.
As an aside, it's pretty clear that building a long `robots.txt` file is a silly way to try and fend off bots---who says `ChatGPT-User` won't be `ChatGPT-PowerUser` next month? And what about `MyCoolBot` releasing soon? There goes your `User-agent` string key.
It is clear that crawlers keep track of the source URLs; otherwise I wouldn't be able to find my site entries in the above C4 search link. So why not at least provide these source links as citations to your users? That's called common courtesy. In academia, blatantly copying text of others without providing accurate references will get you into trouble. Of course, most of big tech's not-so-secret rules resolve around stealing and paywalling your content, so why should language learning models be any different.
---
@ -38,10 +40,10 @@ And then we haven't talked about the ignoring of licenses yet. Last time this ha
Ever since the launch of this site, I've been an avid follower of Leo Babauta's [uncopyright mindset](https://mnmlist.com/uncopyright-and-a-minimalist-mindset/). Under `/no-copyright-no-tracking`, I wrote:
> I've always detested the this is mine!-mindset, especially when it comes to intellectual property. Everyone benefits if everything is open and everyone can build upon each others work. A possible financial loss is not an excuse. Leo has found copyrights not to be particularly helpful, so he simply got rid of them. He sells thousands of ebooks monthly. You have the right to share them with friends. He would rather have you buy them, but this way his work reaches a broader audience.
> I've always detested the this is mine!-mindset, especially when it comes to intellectual property. Everyone benefits if everything is open and everyone can build upon each others work. A possible financial loss is not an excuse. Leo has found copyrights not to be particularly helpful, so he simply got rid of them. He sells thousands of e-books monthly. You have the right to share them with friends. He would rather have you buy them, but this way his work reaches a broader audience.
In light of the recent "advancement" in the field of commercial AI, I'm afraid I have to change that. I hate resorting to a confusing Creative Commons license, but MIT is specifically geared towards software instead of writing, and the absolute least I want to enforce is **attribution**. So henceforth, _Brain Baking_ is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
Which, of course, changes little. Microsoft happily ignored any `LICENSE` files when gobbling up repositories for GitHub CoPilot, and web scrapers/bots OpenAI and the like utilize are sure to do the same. Still, at least on my website it states that while you can do whatever you want with what's written here, you _should_ have the courtesy to correctly attribute the source.
Which, of course, changes little. Microsoft happily ignored any `LICENSE` files when gobbling up repositories for GitHub CoPilot, and web scrapers/bots OpenAI and the like are sure to do the same. Still, at least on my website it states that while you can do whatever you want with what's written here, you _should_ have the courtesy to correctly attribute the source.
I still like and believe the _Sharing Is Caring_ mantra, but please don't mistake it for _Stealing Is Moneymaking_.
I still like and believe the _Sharing Is Caring_ mantra, but please don't mistake it for _Stealing Is Moneymaking_. I wonder when this bigger data ethics issue will be thoroughly addressed at governmental (and world) level. It's pretty clear to me that without the help of the government to direct how society should handle this, greedy gobblers will keep on gobbling.