Beyond the internet: AI learning from 15th-century texts

Agencies
Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.' Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems.
Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said.
'Our collections are held for public use, and anything we digitized as part of this project will be made public.' Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works.
It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.' At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'

Hashtags

Business

#ChatGPT

#Harvard-based

#AI-ready

#InstitutionalBooks1.0

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Meta offered $100 mn bonuses to poach OpenAI employees

Qatar Tribune

12 hours ago

Qatar Tribune

Meta offered $100 mn bonuses to poach OpenAI employees

Agencies Meta offered $100 million bonuses to OpenAI employees in an unsuccessful bid to poach the ChatGPT maker's talent and strengthen its own generative AI teams, the startup's CEO, Sam Altman, has said. Facebook's parent company -- a competitor of OpenAI -- also offered 'giant' annual salaries exceeding $100 million to OpenAI staffers, Altman said in an interview on the 'Uncapped with Jack Altman' podcast released Tuesday. 'It is crazy,' Altman told his brother Jack in the interview. 'I'm really happy that at least so far none of our best people have decided to take them up on that.' The OpenAI cofounder said Meta had made the offers to 'a lot of people on our team.' Meta did not immediately respond to a request for comment. The social media titan has invested billions of dollars in artificial intelligence technology amid fierce competition in the AI race with rivals OpenAI, Google and Microsoft. Meta chief executive Mark Zuckerberg said in January that the firm planned to invest at least $60 billion in AI this year, with ambitions to lead in the technology. Last week, Meta entered into a deal reportedly worth more than $10 billion with Scale AI, a company specializing in labeling data used in training artificial intelligence models. As part of the deal, company founder and CEO Alexandr Wang will join Meta to help with the tech giant's AI ambitions, including its work on superintelligence efforts. Comparing Meta to his company, Altman said on the podcast that 'OpenAI has a much better shot at delivering on superintelligence.' 'I think the strategy of a ton of upfront guaranteed comp and that being the reason you tell someone to join... I don't think that's going to set up a great culture,' the OpenAI boss added.

China's AI expansion adds to existing employment crisis

Qatar Tribune

12 hours ago

Qatar Tribune

China's AI expansion adds to existing employment crisis

Agencies New York China's rapid AI adoption is reshaping its labour market, with analysts debating whether it will alleviate workforce shortages or deepen deflationary pressures. Morgan Stanley's latest report warns that AI-driven automation could exacerbate China's employment challenges, particularly given its high youth unemployment rate (over 15 percent) and prolonged deflationary trends. AI's ability to replace junior-level cognitive tasks may encourage companies to invest more in technology while reducing hiring, potentially leading to slower wage growth and economic stagnation. However, Ding Shuang, chief Greater China economist at Standard Chartered, argues that AI is not the primary driver of deflation. Instead, overcapacity in key industries and weak domestic demand are the main culprits. Despite concerns, AI's economic potential remains significant. The International Monetary Fund estimates that 40 percent of jobs in emerging markets are exposed to AI, with 16 percent complemented by automation and 24 percent fully replaced. To mitigate disruptions, policymakers should enhance social protections, expand AI-focused education, and encourage job creation in sectors less susceptible to automation. Ultimately, AI's impact on China's economy will depend on balanced policy measures and strategic workforce adaptation. China's rapid AI adoption is reshaping its labour market, sparking debate over whether it will alleviate workforce shortages or deepen deflationary pressures. While AI-driven automation offers long-term productivity gains, concerns persist about its immediate impact on employment and wage growth. China has been battling prolonged deflationary risks, with its producer price index contracting for 30 consecutive months, including a 2.7 percent drop in April. The employment market remains sluggish, with urban youth unemployment exceeding 15 percent. AI could generate a labour equivalent value of 6.7 trillion yuan ($931 billion)—approximately 5 percent of China's nominal GDP. To mitigate disruptions, policymakers should enhance social protections, expand AI-focused education, and encourage job creation in sectors less susceptible to automation. While AI may eliminate some roles, it also fosters new employment opportunities. The future of China's labour market will depend on balanced policy measures and strategic workforce adaptation. Despite optimism surrounding AI's potential, its widespread adoption may not significantly enhance China's productivity. While automation can streamline operations, it does not directly address deeper structural issues such as declining birth rates, shrinking workforce, and weak domestic demand. In 2023, China recorded only 9 million births, the lowest since 1949, with the population dropping for the second consecutive year to 1.4 billion, a decline of over 2 million. AI's displacement of junior-level cognitive jobs could worsen China's employment crisis, particularly as urban youth unemployment exceeds 15 percent. While AI may create new roles, the transition requires extensive reskilling, which China's current education system struggles to accommodate. Additionally, over investment in AI could exacerbate economic stagnation, as companies prioritize automation over workforce expansion. To mitigate these risks, policymakers must strengthen social protections, invest in AI-focused education, and promote job creation in sectors less vulnerable to automation. Without balanced policy measures, AI's rapid integration may deepen economic instability rather than drive sustainable growth. China's deflationary pressures stem from multiple economic factors, with AI playing only a minor role in the broader downturn. The country's weak property market and sluggish domestic demand have been the primary drivers of deflation, overshadowing the impact of AI adoption. Despite concerns about automation displacing workers, recent data suggests that AI's effect on employment remains limited. A UBS survey conducted between March 18 and April 17 among 404 senior executives in China revealed that 73 percent of respondents had integrated AI technologies into their businesses, primarily to enhance efficiency and improve product quality. While 15 percent reported workforce reductions, 22 percent stated they had created new positions due to AI adoption. Additionally, 46 percent of executives noted salary increases linked to productivity gains, whereas 18 percent reported wage cuts for certain employees. These findings indicate that AI has not yet caused widespread disruption in China's labour market or household income. However, as AI adoption accelerates, policymakers must ensure balanced economic strategies that address employment shifts, wage stability, and broader economic challenges. The future of AI's role in China's economy will depend on how businesses and regulators navigate technological advancements alongside existing structural issues.

Al Jazeera

a day ago

Al Jazeera

Can ChatGPT be your therapist?

AI chatbots can reduce anxiety and depression, according to recent research. As chatbot therapy goes mainstream, can it replace a real therapeutic relationship?