
AI chatbots need more books to learn from
CAMBRIDGE, Mass — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated US$50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Matt O'brien, The Associated Press
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Globe and Mail
an hour ago
- Globe and Mail
President Trump's Plan to End Taxes on Overtime Pay Could Become Reality Sooner Than You Think
President Trump made a lot of attention-grabbing promises during his campaign for a second term, and in his five months since retaking office, he's met with varying degrees of success in enacting them. Ending taxes on Social Security benefits, for example, appears no nearer than it was on Trump's first day in office. But he's made considerable headway with some of his other agenda items, including ending taxes on overtime pay. We could see this enacted yet this year, but there are still some important details to be ironed out. The "One, Big, Beautiful Bill" could end taxes on overtime as soon as this year House Republicans recently drafted the "One, Big, Beautiful Bill" that incorporates several of President Trump's key campaign promises, like an end to taxes on tips and overtime pay. It would create an above-the-line deduction for these items, so you wouldn't owe any income taxes on them. The House version of the bill clarifies that the tax deduction for overtime pay would apply only to overtime compensation that is paid to an individual in excess of the regular rate they receive for their work. This deduction wouldn't be available to highly compensated employees (HCEs) and those without a work-eligible Social Security number. The bill narrowly passed the House, and lawmakers initially hoped it would make it to the president's desk by July 4, 2025. But the Senate is determined to make its mark on the legislation as well, and at least some senators, like Ron Johnson (R-WI), feel the July 4 deadline isn't realistic. The Senate is already making changes The Senate's version of the "One, Big, Beautiful Bill" isn't finalized yet, but we've already had a peek at some of the changes it hopes to make. While the general idea of no taxes on overtime pay remains in the latest version of the bill, the Senate has added one important restriction. The tax deduction for overtime pay would be limited to $12,500 per person or $25,000 for married couples filing a joint return. While this should be adequate for most people, it may be disappointing if you earn a lot of money from overtime. The House version of the bill didn't have any restrictions on the overtime tax deduction. There are also income phaseouts that reduce the deduction by $100 for every $1,000 your modified adjusted gross income (MAGI) exceeds $150,000 for a single adult or $300,000 for a married couple. Individuals with MAGIs of $275,000 or more and married couples with MAGIs of $550,000 or more wouldn't be eligible to claim the deduction. It's not a done deal yet Senate Republicans can only afford to lose three Republican votes if they want the bill to pass, and right now, there are several who are voicing concerns about the bill in its current state. This means that it likely won't be passed in the next couple of weeks at least. There isn't a vote scheduled currently. If it does pass, the overtime tax deduction will take effect for the 2025 tax year, according to the current bill. However, it would only remain in place through the 2028 tax year. After that, it would be up to lawmakers to decide whether to continue the deduction or not. The $23,760 Social Security bonus most retirees completely overlook If you're like most Americans, you're a few years (or more) behind on your retirement savings. But a handful of little-known "Social Security secrets" could help ensure a boost in your retirement income. One easy trick could pay you as much as $23,760 more... each year! Once you learn how to maximize your Social Security benefits, we think you could retire confidently with the peace of mind we're all after. Join Stock Advisor to learn more about these strategies.


Globe and Mail
an hour ago
- Globe and Mail
This Global REIT Is Riding Asia's AI Wave Straight to the Bank
Equinix (NASDAQ: EQIX) is a powerhouse in digital infrastructure and part of a new class of innovative real estate investment trusts (REITs) laying the groundwork to become the future of real estate investing. It has a strong history of capitalizing on international technology trends that traditional REITs can't match. With a strategic expansion in Indonesia, Equinix is positioning itself for explosive growth, and Wall Street is beginning to take notice. Where to invest $1,000 right now? Our analyst team just revealed what they believe are the 10 best stocks to buy right now. Learn More » Why this digital land grab is a big deal Asia is becoming the global epicenter of digital demand, and Indonesia is leading the charge. It's attracting major investments in cloud computing, artificial intelligence (AI), and fintech. By entering the market early and scaling aggressively, Equinix is developing a strategic advantage that will be tough for competitors to match. Indonesia's data center market is projected to grow at a compound annual growth rate (CAGR) of 8% to $3.79 billion through 2030. Cloud giants like Amazon Web Services and Alphabet 's Google Cloud have already announced major investments, but their platforms need physical infrastructure to function. That's what makes Equinix's expansion into Jakarta so strategic. Its newly opened data center is no ordinary server farm. Built to support intensive computing tasks like training and running AI systems, Equinix is creating the critical backbone necessary for digital business growth in Indonesia. This could make Equinix one of tech's most valuable players. The average analyst price target sees Equinix at $1,009, 10% percent higher than it currently sits, a nod to its forward-looking strategy and savvy market expansion. What gives Equinix the edge In addition to its Indonesian assets, Equinix operates 270 data centers across five continents and 35 countries. It has a great track record with its customers, retaining 98% of them. As of Q1 2025, Equinix reported over $2.1 billion in annual adjusted earnings before interest, taxes, depreciation, and amortization (EBITDA). This strong combination of global scale, customer loyalty, and reliable earnings is exactly what sets the stage for Equinix's move into Jakarta to be a success. While it's not the highest-dividend REIT, Equinix pays investors 2% annually. But considering its growth trajectory looks more like a tech company than a traditional REIT, that's not too bad. The risks to watch Equinix does face potential pressures though. Their total capital expenditures for 2025 are projected between $3.4 billion and $3.7 billion, with non-recurring expenditures accounting for around 95% of that. This significant investment is partly due to the need to modernize legacy data centers to meet new levels of demand. While these upgrades are essential, they represent a substantial financial commitment that could impact short-term profitability. That said, Equinix ended Q1 2025 with roughly $2.95 billion in cash and cash equivalents and an ample $7.6 billion in total available liquidity. The balance sheet looks sturdy enough to fund expansion without putting shareholders at undue risk. Geopolitical tensions are also on the periphery of investor concerns. As Equinix operates globally, it must navigate regulatory, monetary, and political risks in emerging markets. But these risks appear to be well managed by the company's leadership, and its long-term leases, high renewal rates, and diversified customer base provide stability. It's time to stake your claim in the future of tech real estate Some investors still think REITs are too risky and don't deliver enough value. Those perceptions are often based on underperforming traditional sectors like retail or office space. That's where tech-powered REITs like Equinix come in. Gone are the days when investing in real estate meant buying a piece of something on the ground. Now you're buying into the cloud. Even in comparison to peers like Digital Realty, Equinix still stands out. It has a stronger international footprint, a more premium client base, and better historical uptime. If you're looking for a REIT that combines growth potential with resilience in the digital age, Equinix is arguably a top-tier pick. AI is only as powerful as the infrastructure behind it, and Equinix is building the digital backbone on which the future will run. Jakarta may just be one dot on the map, but it signals Equinix is putting itself at the forefront of the global shift. With recurring revenue, global scale, and a pioneering foothold in high-growth markets like Indonesia, this REIT could quietly become one of the most important tech stocks of the next decade. Investors looking to profit from AI's global expansion without the volatility of pure-play tech stocks may want to give Equinix a closer look. It might not be a flashy choice, but it's in a solid state and could be the smartest upgrade your portfolio makes this year. Should you invest $1,000 in Equinix right now? Before you buy stock in Equinix, consider this: The Motley Fool Stock Advisor analyst team just identified what they believe are the 10 best stocks for investors to buy now… and Equinix wasn't one of them. The 10 stocks that made the cut could produce monster returns in the coming years. Consider when Netflix made this list on December 17, 2004... if you invested $1,000 at the time of our recommendation, you'd have $664,089!* Or when Nvidia made this list on April 15, 2005... if you invested $1,000 at the time of our recommendation, you'd have $881,731!* Now, it's worth noting Stock Advisor 's total average return is994% — a market-crushing outperformance compared to172%for the S&P 500. Don't miss out on the latest top 10 list, available when you join Stock Advisor. See the 10 stocks » *Stock Advisor returns as of June 9, 2025 John Mackey, former CEO of Whole Foods Market, an Amazon subsidiary, is a member of The Motley Fool's board of directors. Suzanne Frey, an executive at Alphabet, is a member of The Motley Fool's board of directors. Philippa Main has no position in any of the stocks mentioned. The Motley Fool has positions in and recommends Alphabet, Amazon, Digital Realty Trust, and Equinix. The Motley Fool has a disclosure policy.


Globe and Mail
2 hours ago
- Globe and Mail
Meta Platforms (META) Reportedly Held Acquisition Talks with Perplexity AI
Meta Platforms (META) reportedly held acquisition talks with Perplexity AI, an artificial intelligence search startup, before ultimately deciding to invest heavily in another AI company, Scale AI. According to Bloomberg, the discussions between Meta and Perplexity did not result in a deal, and both companies walked away without pursuing the acquisition. Neither Meta nor Perplexity have commented publicly on the matter. Confident Investing Starts Here: Easily unpack a company's performance with TipRanks' new KPI Data for smart investment decisions Receive undervalued, market resilient stocks right to your inbox with TipRanks' Smart Value Newsletter Instead, Meta announced earlier this month that it would invest $14.3 billion in Scale AI, which allowed it to acquire a 49% stake in the company and value it at over $29 billion. As part of the deal, Scale's CEO Alexandr Wang will join Meta to lead its new 'superintelligence' division, which focuses on developing artificial general intelligence (AGI). For context, AGI refers to AI systems that can understand, learn, and apply knowledge across a wide range of tasks at a human-like level of intelligence. This move highlights Meta's shift toward building more advanced AI systems that go beyond narrow task-based models. Founded in 2016, Scale AI helps train generative AI models by connecting them with a large network of human experts. Meta's significant investment in the company shows that CEO Mark Zuckerberg is determined to strengthen the company's position in the AI market. And Zuckerberg is not the only one, as many other tech giants have invested heavily in AI startups ever since Microsoft's blockbuster deal with OpenAI in 2023. Is Meta a Buy, Sell, or Hold? Turning to Wall Street, analysts have a Strong Buy consensus rating on META stock based on 42 Buys, three Holds, and one Sell assigned in the past three months, as indicated by the graphic below. Furthermore, the average META price target of $707.16 per share implies that shares are almost fairly valued. See more META analyst ratings