AI Safety and Fair Use
Issues of AI safety and fair use seem like they are novel problems.
The New York Times recently sued OpenAI for training ChatGPT on NYT articles without licensing them — or in other words, stealing their data. Authors, musicians, and artists are piling onto class action suits against OpenAI with similar allegations.
From a supply chain perspective, data is the primary input into AI applications. Without a large amount of high quality data (i.e. a truth set or ground truth), the machine learning algorithms underpinning these applications do not work.
AI and data companies have two approaches to collect data:
They get the data through fair and transparent licensing agreements that benefit the original owner of the data.
They operate in the shadows and steal data or obtain it through exploitative contracts that do not fairly compensate the original data owner and expose them to safety risks.
Unfortunately the second method, where they use covert and exploitative methods is the status quo.
On the safety-front, AI-powered tools have significantly lowered the barrier of entry to creating deepfakes, realistic-sounding spam, and more. An AI safety bill was recently proposed and veto’d in California:
SB 1047 would’ve mandated that companies developing powerful AI models take reasonable care to ensure that their technologies wouldn’t cause “severe harm” such as mass casualties or property damage above $500 million.
OpenAI opposed this legislation:
“The AI revolution is only just beginning, and California’s unique status as the global leader in AI is fueling the state’s economic dynamism,” Jason Kwon, chief strategy officer at OpenAI, wrote in a letter last month opposing the legislation. “SB 1047 would threaten that growth, slow the pace of innovation, and lead California’s world-class engineers and entrepreneurs to leave the state in search of greater opportunity elsewhere.”
The recent explosion in popularity of consumer AI software has shined a light on AI safety and fair use.
A murky and loosely-regulated data supply chain results in technology companies and their customers fleecing the original owners of the data. These original data owners also get exposed to harm through misuse or weaponization of their data against them.
These are not novel problems. They’ve been around for decades.
An AI company requires 3 main inputs to build their products: machine learning algorithms, data to feed their models, and compute (i.e. data centers) to train and deploy their models.
The machine learning algorithms are becoming commodified — Facebook is releasing high quality algorithms (e.g. Llama) to try to neutralize the traction of startups like OpenAI and Anthropic.
The compute is also widely accessible and the unit economics have generally trended down. Compute pricing is a race to the bottom.
As a result, this makes data the most strategic input.
A comprehensive regulatory solution to AI safety and fair use protects people without killing innovation.
Progress in AI requires a lot of data. I have no interest in stalling progress in this field. But I believe that mortgaging peoples’ civil rights, data ownership rights, and safety in the name of increased data collection is a shortcut not worth taking.
I think some sort of intermediary, like a data broker, is necessary to make this whole system work smoothly. But these brokers can’t be trusted to operate without oversight.