
AI chatbots need more books to learn from. These libraries are opening their stacks
By MATT O'BRIEN
Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'" Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
© Copyright 2025 The Associated Press. All rights reserved. This material may not be published, broadcast, rewritten or redistributed without permission.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Kyodo News
an hour ago
- Kyodo News
Japanese restaurants, food companies switching to noodles from rice
KYODO NEWS - 2 minutes ago - 11:18 | All, Japan As cost-sensitive consumers are steering clear of rice after a price surge to record highs, restaurant chains and food companies are turning to noodles. Antworks Co., operator of Densetsu no Sutadonya eateries offering pork rice bowls, opened its first ramen noodle restaurant in Tokyo in May and plans another three by next February to diversify its operations. "We have touted our (pork bowl) business as a large portion of our business portfolio but rice prices are now more than three times higher than those several years ago," a spokesperson at the Tokyo-based restaurant chain operator said. A pork rice bowl set meal with miso soup and raw egg is now priced at 890 yen ($6), compared with 630 yen in 2021. The spokesperson said consumers would likely stay away if the price were raised to over 1,000 yen. "The outlook for our business would be more severe if we were just focusing on (pork) bowls," the official said, adding that the cost of a ramen noodle dish is 100 yen to 150 yen cheaper than a pork bowl. Yoshinoya Holdings Co., the parent of major beef bowl restaurant operator Yoshinoya Co., is also strengthening its ramen noodle business, as it views the beef bowl restaurant market in Japan as saturated. Yoshinoya Holdings Executive Vice President Norihiro Ozawa says its ramen noodle business allows the company to "balance" ingredient costs with offerings aside from rice and meat. According to the Ministry of Agriculture, Forestry and Fisheries, rice prices have doubled from a year earlier and remain around a spike initially triggered by a poor harvest. The average price stood at 4,176 yen per 5 kilograms in the week through June 8, despite releases from the government's rice stockpiles. At supermarkets, consumers are looking for alternatives to rice. According to TableMark Co., sales of its frozen udon noodles grew around 10 percent in value terms in April and May from a year earlier, while sales of Kikkoman Corp.'s packaged udon noodle soup and ingredients rose 10 percent in the three months through May from the same period last year. Meiji Holdings Co. said sales of its mainstay Meiji Bulgaria Yogurt products have maintained around 10 percent growth each month since April last year. "Western-style breakfast foods such as bread and cereal have become more popular" amid the increase in rice prices, a Meji Holdings official said. Related coverage: Japan's core consumer prices in May rise 3.7% on surging rice costs Football: Locally produced rice funds Yokohama FC player development


Japan Times
4 hours ago
- Japan Times
U.S. judge blocks Trump ban on foreign students at Harvard
A federal judge on Friday indefinitely paused Donald Trump's bid to block Harvard from enrolling foreign students as the U.S. president said a "deal" with the Ivy League school was in the works. The order by District Judge Allison Burroughs will allow international students to continue to attend the elite university while a lawsuit filed by Harvard plays out in the courts. Trump, who has cut federal grants for Harvard and tried a host of different tactics to block the institution from hosting international students, said that his administration has been holding negotiations with Harvard. "Many people have been asking what is going on with Harvard University and their largescale improprieties that we have been addressing, looking for a solution," Trump said in a post Friday on Truth Social. "We have been working closely with Harvard, and it is very possible that a Deal will be announced over the next week or so," he said. "If a Settlement is made on the basis that is currently being discussed, it will be 'mindbogglingly' HISTORIC, and very good for our Country." Trump did not provide any details about the purported "deal." The Trump administration has sought to remove Harvard from an electronic student immigration registry and instructed embassies to deny visas to international students hoping to attend the Massachusetts-based university. Harvard has sued the Department of Homeland Security and other agencies to block the efforts, arguing that they were illegal and unconstitutional. Harvard previously secured two temporary restraining orders from Burroughs against the government's move to bar international students, and the judge extended it with a preliminary injunction on Friday. International students accounted for 27% of total enrollment at Harvard in the 2024-2025 academic year and are a major source of income. In court filings, Harvard argued that Trump's actions were "retribution for Harvard's exercising its First Amendment rights to reject the government's demands to control Harvard's governance, curriculum, and the 'ideology' of its faculty and students." Alongside the campaign against foreign students, the Trump administration has also cut around $3.2 billion of federal grants and contracts benefiting the university and pledged to exclude it from any future federal funding. Harvard has been at the forefront of Trump's campaign against top universities after it defied his calls to submit to oversight of its curriculum, staffing, student recruitment and "viewpoint diversity." Trump and his allies claim that Harvard and other prestigious universities are unaccountable bastions of liberal, anti-conservative bias and anti-Semitism.


Yomiuri Shimbun
5 hours ago
- Yomiuri Shimbun
Foreign Hotel Chains Transforming Japanese Market; Luxury Brands Open in Diverse, Non-Tokyo Locations
The Japanese hotel market is entering a new era. As the tourism industry shifts its focus from quantity to quality, foreign hotel chains are increasingly targeting Japan as one of the few growth areas in a mature market, drawn by its appeal and investment opportunities. New concepts such as wellness and integration with local culture are emerging as competitive advantages, and a fresh breeze of 'Japanese luxury' is beginning to blow. The Japanese luxury hotel market has historically been dominated by prominent Japanese brands, notably the 'Big Three' of the Imperial Hotel Group, Okura Nikko Hotels, and New Otani Hotels. However, there has been a notable surge in the presence of foreign brands outside the Tokyo metropolitan area in recent years, leading to increased competition and the introduction of new value propositions. The Singapore-based hotel chain Capella Hotel Group has selected Osaka as the site for its inaugural establishment in Japan. The group, which developed Capella Singapore — the venue for the first U.S.-North Korea summit in 2018 — opened Patina Osaka in May of this year just across the moat from Osaka Castle, a popular destination for visitors to Japan. Patina presents itself as a lifestyle brand tailored to a new generation of travelers. John Blanco, the 'cluster general manager' for Capella Kyoto and Patina Osaka, stated in an interview, 'We aim to provide customers with not just a place to stay, but a unique, locally rooted experience.' He revealed plans to collaborate with famous graphic artists from Osaka for events, offer programs that utilize the region's cultural heritage and natural environment and develop region-specific menus. The group also plans to introduce hidden local attractions and establishments not featured in guidebooks. Umeda, the area around Osaka Station, is attractive for its excellent access to tourist attractions in Kyoto, Nara and Kobe, all within an hour's reach, and has been called 'Osaka's last prime location' for development. In April, the Waldorf Astoria Osaka, representing the premium brand of Hilton Hotels, opened in this area, marking the brand's debut in Japan ahead of its planned opening of a hotel in Tokyo. The Waldorf Astoria, established in New York in 1893, is a renowned brand with a distinguished history of offering exclusive menus and services. Prominent architect Andre Fu designed the hotel in Osaka, blending art deco elements popular in the brand's early days with Japanese-inspired touches. This design reflects the abundance of art deco architecture in Osaka and aims for 'harmony with the local community.' Besides the brand's reputation and glamour, this reflects a growing trend toward sustainability and support for local culture. Joseph Khairallah, Hilton's area vice president and head of Japan, Korea and Micronesia, said, 'We can provide guests with a unique experience.' In Shikoku, four regional banks invested in a company to attract hotels, hoping to promote the use of local activities and foods as luxury brands seek regional cooperation and the central government aims to disperse foreign tourists across Japan in light of their current overconcentration in certain areas. The company plans to open a hotel in Kagawa Prefecture in the summer of 2027 in collaboration with Mandarin Oriental Hotel Group, based in Hong Kong. In addition to the fact that there are few foreign-affiliated hotels in rural areas, creating a gap between supply and demand, the company aims to expand consumption in the region by attracting visitors to Japan and creating job opportunities. Changes are also underway in the midrange hotel sector, which has historically been dominated by domestic chains such as the APA Hotel and Resorts Group and Toyoko Inn. In 2025, IHG Hotels & Resorts opened its first midrange brand, Garner, in Osaka. While Marriott International is expanding its luxury brands, such as Ritz-Carlton and St. Regis, and lifestyle brands, including Moxy and Aloft, in major cities, it is developing midrange brands in Kyushu and Hiroshima, where it is making new investments. The objective is to position these brands as 'a gateway to lifelong use of Marriott' and to develop the market for the younger generation. Prolonged deflation and zero wage growth have led to low wage levels in Japan's service industry, resulting in a severe labor shortage. Blanco of Capella highlighted that the most significant challenge in establishing the hotel was recruiting personnel, emphasizing that they offered the highest salary levels in the industry and had created a workplace environment where employees could take pride in every aspect, from their uniforms to the cafeteria. He added: 'There are some aspects of hospitality that somebody cannot teach through training. We proactively engage with motivated students enrolled in vocational and hotel schools at an early stage.' In the future, the hotel industry's focus will be on investing in human resources effectively. This will include the development of new evaluation criteria for employees and the creation of training programs tailored to their needs. Political Pulse appears every Saturday. Shingo SugimeShingo Sugime is a deputy editor in the Economic News Department of The Yomiuri Shimbun Osaka.