AI ‘gold rush’ for chatbot training data could run out of human-written text
The work was supported by the Air Force Office of Scientific Research, the Office of Naval Research, and the US National Science Foundation. The framework models the complex mechanical behavior of spinodal microstructures by combining submicron 3D printing, in-situ electron microscopy testing, and deep learning. It accurately captures nonlinear, directional stress-strain responses with prediction errors as low as 5 to 10 percent.
AI ‘gold rush’ for chatbot training data could run out of human-written text as early as 2026
- Much has changed since then, including new techniques that enabled AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times.
- The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria.
- And this is because organizations are better understanding the importance of high-quality data to the success of AI initiatives.
- By leveraging advanced technologies like AI and machine learning, organizations can ensure that data flows seamlessly through the pipeline, enabling real-time analytics and faster decision-making.
- Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy.
“It also gives students an opportunity to work on a project that they can further talk about during interviews.” Sports coaches are always looking for ways to improve their teams and put them in the best position to succeed. However, Sagiraju sees that the gap is slowly narrowing year over year when it comes to understanding the challenges of AI.
Artificial intelligence is being asked to predict the future of AI
“As the person who uses the data the teams were provided, it was an amazing experience to see unique solutions that the teams presented,” he said. “It inspired me to create new solutions to the program’s issues and I cannot wait to implement some of the projects into the program’s workflow.” “This accelerated timeline taught us critical lessons in rapid decision-making, collaborative teamwork, and efficient problem-solving,” he said. “It’s a unique opportunity to simulate real-world, high-pressure scenarios where delivering impactful solutions quickly is crucial.” “While many teams start off with manually labeling their datasets, more are turning to time-saving methods to partially automate the process,” Sagiraju said.
“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.” From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance.
The method offers a way to accelerate the development of lighter, stronger, and more energy-efficient materials, with potential applications in aerospace, defense, biomedical implants, and electronics. It reduces the need for costly and time-intensive trial-and-error testing, which has traditionally slowed innovation in materials science. The method enables faster, more cost-effective development of advanced materials with tailored properties, reducing reliance on time-consuming experiments and simulations.
As he collected data and watched students collaborate during the event, he realized how important it is to understand the problems clients face before identifying solutions. Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem. Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. You lose some of the information,” Papernot said.
Sign up to our weekly newsletter
An example is self-driving car companies, which face regulatory, safety and legal challenges in obtaining data from real roads. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online. AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said.
If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves — websites like Reddit and Wikipedia, as well as news and book publishers — have been forced to think hard about how they’re being used. Companies use artificially generated data to complement the data they collect from the real world. Synthetic data is especially useful in applications where obtaining real-world data is costly or dangerous.
With proper data governance, the pharma industry can improve patient-centricity in trials and bring lifesaving therapies to market quickly and safely. The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks. The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks. Jordan Betterman (MLDS ’25) is a graduate assistant for Northwestern’s men’s soccer team and was responsible for gathering the data students used during the Hackathon.
Subscribe to our Email Newsletters
The automated labels are not perfect, and a human labeler must review and adjust them, but they speed up the process significantly. In addition, the automated labeling system can be further trained and improved as it receives feedback from human labelers. Biased, mislabeled, inconsistent or incomplete data reduces the quality of ML models, which in turn harms the ROI of AI initiatives. Postdoctoral researcher Luciano Borasi created a unified method to study how materials behave across the full spectrum of deformation speeds. “This work overcomes those challenges,” said Krishnaswamy, director of the Center for Smart Structures and Materials and professor of mechanical engineering.
News
By 2030, AI-powered drug discovery is projected to be a $9.1 billion market, growing at a staggering 29.7% CAGR. AI promises to accelerate clinical trials, optimize supply chains and personalize patient treatments at scales previously unimaginable. But there are limits, and after further research, Epoch now foresees running out of public text data sometime in the next two to eight years.