Popular AIs head-to-head: OpenAI beats DeepSeek on sentence-level reasoning

Yahoo17-04-2025

ChatGPT and other AI chatbots based on large language models are known to occasionally make things up, including scientific and legal citations. It turns out that measuring how accurate an AI model's citations are is a good way of assessing the model's reasoning abilities.
An AI model 'reasons' by breaking down a query into steps and working through them in order. Think of how you learned to solve math word problems in school.
Ideally, to generate citations an AI model would understand the key concepts in a document, generate a ranked list of relevant papers to cite, and provide convincing reasoning for how each suggested paper supports the corresponding text. It would highlight specific connections between the text and the cited research, clarifying why each source matters.
The question is, can today's models be trusted to make these connections and provide clear reasoning that justifies their source choices? The answer goes beyond citation accuracy to address how useful and accurate large language models are for any information retrieval purpose.
I'm a computer scientist. My colleagues − researchers from the AI Institute at the University of South Carolina, Ohio State University and University of Maryland Baltimore County − and I have developed the Reasons benchmark to test how well large language models can automatically generate research citations and provide understandable reasoning.
We used the benchmark to compare the performance of two popular AI reasoning models, DeepSeek's R1 and OpenAI's o1. Though DeepSeek made headlines with its stunning efficiency and cost-effectiveness, the Chinese upstart has a way to go to match OpenAI's reasoning performance.
The accuracy of citations has a lot to do with whether the AI model is reasoning about information at the sentence level rather than paragraph or document level. Paragraph-level and document-level citations can be thought of as throwing a large chunk of information into a large language model and asking it to provide many citations.
In this process, the large language model overgeneralizes and misinterprets individual sentences. The user ends up with citations that explain the whole paragraph or document, not the relatively fine-grained information in the sentence.
Further, reasoning suffers when you ask the large language model to read through an entire document. These models mostly rely on memorizing patterns that they typically are better at finding at the beginning and end of longer texts than in the middle. This makes it difficult for them to fully understand all the important information throughout a long document.
Large language models get confused because paragraphs and documents hold a lot of information, which affects citation generation and the reasoning process. Consequently, reasoning from large language models over paragraphs and documents becomes more like summarizing or paraphrasing.
The Reasons benchmark addresses this weakness by examining large language models' citation generation and reasoning.
Following the release of DeepSeek R1 in January 2025, we wanted to examine its accuracy in generating citations and its quality of reasoning and compare it with OpenAI's o1 model. We created a paragraph that had sentences from different sources, gave the models individual sentences from this paragraph, and asked for citations and reasoning.
To start our test, we developed a small test bed of about 4,100 research articles around four key topics that are related to human brains and computer science: neurons and cognition, human-computer interaction, databases and artificial intelligence. We evaluated the models using two measures: F-1 score, which measures how accurate the provided citation is, and hallucination rate, which measures how sound the model's reasoning is − that is, how often it produces an inaccurate or misleading response.
Our testing revealed significant performance differences between OpenAI o1 and DeepSeek R1 across different scientific domains. OpenAI's o1 did well connecting information between different subjects, such as understanding how research on neurons and cognition connects to human-computer interaction and then to concepts in artificial intelligence, while remaining accurate. Its performance metrics consistently outpaced DeepSeek R1's across all evaluation categories, especially in reducing hallucinations and successfully completing assigned tasks.
OpenAI o1 was better at combining ideas semantically, whereas R1 focused on making sure it generated a response for every attribution task, which in turn increased hallucination during reasoning. OpenAI o1 had a hallucination rate of approximately 35% compared with DeepSeek R1's rate of nearly 85% in the attribution-based reasoning task.
In terms of accuracy and linguistic competence, OpenAI o1 scored about 0.65 on the F-1 test, which means it was right about 65% of the time when answering questions. It also scored about 0.70 on the BLEU test, which measures how well a language model writes in natural language. These are pretty good scores.
DeepSeek R1 scored lower, with about 0.35 on the F-1 test, meaning it was right about 35% of the time. However, its BLEU score was only about 0.2, which means its writing wasn't as natural-sounding as OpenAI's o1. This shows that o1 was better at presenting that information in clear, natural language.
On other benchmarks, DeepSeek R1 performs on par with OpenAI o1 on math, coding and scientific reasoning tasks. But the substantial difference on our benchmark suggests that o1 provides more reliable information, while R1 struggles with factual consistency.
Though we included other models in our comprehensive testing, the performance gap between o1 and R1 specifically highlights the current competitive landscape in AI development, with OpenAI's offering maintaining a significant advantage in reasoning and knowledge integration capabilities.
These results suggest that OpenAI still has a leg up when it comes to source attribution and reasoning, possibly due to the nature and volume of the data it was trained on. The company recently announced its deep research tool, which can create reports with citations, ask follow-up questions and provide reasoning for the generated response.
The jury is still out on the tool's value for researchers, but the caveat remains for everyone: Double-check all citations an AI gives you.
This article is republished from The Conversation, a nonprofit, independent news organization bringing you facts and trustworthy analysis to help you make sense of our complex world. It was written by: Manas Gaur, University of Maryland, Baltimore County
Read more:
Why building big AIs costs billions – and how Chinese startup DeepSeek dramatically changed the calculus
What is an AI agent? A computer scientist explains the next wave of artificial intelligence tools
AI pioneers want bots to replace human teachers – here's why that's unlikely
Manas Gaur receives funding from USISTEF Endowment Fund.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Perplexity's new AI features are a game changer. Here's how to make the most of them

Fast Company

22 minutes ago

Fast Company

Perplexity's new AI features are a game changer. Here's how to make the most of them

Perplexity has become my primary tool for search. I rely on it for concise summaries of complex topics. I like the way it synthesizes information and provides reliable citations for me to explore further. I prefer Perplexity's well-organized responses to Google's laundry list of links, though I still use Google to find specific sites & addresses and for other 'micro-searches.' Perplexity's not perfect. I've rarely seen it hallucinate, but it can pick dubious sources or misinterpret your question. As with any tool that uses AI, the wording of your query impacts your result. Write detailed queries and specify preferred sources when you can. Double-check critical data or facts. Google's new AI Mode is a strong new competitor, and ChatGPT, Claude and others now offer AI-powered search, but I still rely on Perplexity for reasons detailed below. This post updates my previous post with new features, examples, and tips. My favorite new features Labs. Create slides, reports, dashboards, and Web apps by writing a detailed query and specifying the format of the results you want. Check out the Project Gallery for 20 examples. Voice Mode. I ask historical questions about books, curiosities about nature and science, and things I should already know about movies & music. The transcript shows up afterwards. Templates for Spaces. A large new collection of templates makes it easier to get started with custom instructions for various kinds of research, for sales/marketing, education, finance, or other subjects. Transcription. Upload & transcribe files up to 25mb. Ask for insights & ideas. Topical landing pages for finance, travel, shopping, and academics provide useful examples and new practical ways to use Perplexity. When to use Perplexity Get up to speed on a topic: Need to research North Korea-China relations? Ask Perplexity for a summary and sources. See the result. Research hyper-specific information: Ask for a list of organizations that crowdsource info about natural disasters. See the result. Explore personal curiosities: I was curious about Mozart's development as a violinist, so I asked for key dates and details. See the result. The best things about Perplexity Sources. Perplexity provides links to its sources, so you can follow-up on anything you want to learn more about. Tip: specify sources to prioritize. Summaries. Instead of long articles or lists of links, get straight-to-the-point answers that save time. Tip: specify when you want a summary table. Follow-ups. Ask follow-up questions to dive deeper into a topic, just like a conversation. For visual topics, Perplexity can surface relevant images and videos. Tip: customize your own follow-up query if defaults aren't relevant. Deep Research. Get fuller results for queries where you need more info. Tip: Use Claude or ChatGPT to help you draft clearer, more thorough search prompts. Spaces. Group related searches into collections so they're easy to return to later. I created one for Atlanta before a trip. You can keep a collection private, invite others to edit it, or share a public link. Tip: create a team space. Pages. Share search results by creating public pages you can customize. Watch a 1-minute video demo. Examples: Beginners Guide to Drumming, a Barcelona itinerary, and forest hotels in Sweden. Labs. Use Perplexity More Effectively You can use Perplexity on the Web, Mac, Windows, iOS and Android. Start with Perplexity's own introductory guide, check the how it works FAQ, then use the Get Started template to use Perplexity itself to learn more. Write detailed queries Include two or more sentences specifying what you're looking for and why. Your result will be better than if you just use keywords. Refine your settings Specify one or more preferred source types: Web, academic sources, social (i.e. Reddit), or financial (SEC filings). Pick your model. Advanced users can specify the AI flavor Perplexity uses. I'd recommend maintaining Perplexity's default or the o3 option for research that requires complex reasoning. You can also use Grok, Gemini or Claude. Specify domains to search. Mention specific domains or kinds of sites you're interested in for more targeted results. Use a domain limiter to narrow your search to a particular site or domain type, e.g. 'domain:.gov' to focus only on government sites. Or just use natural language to limit Perplexity to certain kinds of sites, as in this example scouring CUNY sites for AI policies. Personalize your account. Add a brief summary of your interests, focus areas, and information preferences in your profile to customize the way Perplexity provides you with answers. Quick searches are fine when you're just looking for a simple fact, like when was CUNY founded. Pro searches are best for more intricate, multi-part queries. On the free plan you get 3 pro searches a day. Examples: Perplexity in action Check public opinion: 'Is there a Pew survey about discovering news through social media platforms?' See the result. Explore historical archives: 'List literacy and education programs in high-growth African countries in the last decade.' See the result. Pricing Free for unlimited quick searches, 3 pro searches and 3 file uploads per day. $20/month for unlimited file and image uploads for analysis; access to Labs; and 10x as many citations. See the 2025 feature comparison. Privacy To protect your privacy when using Perplexity, capitalize on the following: Turn 'data retention' off in your settings. (Screenshot). Turn on the Incognito setting if you're signed in to anonymize a search. Search in an incognito browser tab without logging into Perplexity. Bonus features The free Chrome Extension lets you summon a Perplexity search from any page. The 'summarize' button hasn't always worked for me. The Perplexity Encyclopedia has a collection of tool comparisons An experimental beta Tasks feature lets you schedule customized searches Listen to an AI audio chat about Perplexity I generated w/ NotebookLM. Caveats Accuracy and confabulation: While Perplexity uses retrieval augmented generation to reduce errors, it's not flawless. Check the sources it references. Document analysis limitations: The file size limit for uploads is 25MB. Covert larger files to text or use Adobe's free compressor or SmallPDF. Deep Research, though fast, is not nearly as thorough as what is provided by ChatGPT's Deep Research or Gemini's. Alternatives to Perplexity Google AI Mode: Google's much-improved new AI search option provides summary responses like Perplexity. Here's an example of a comparison table it created for me and its take on 10 Perplexity features. Try it in labs. Free. Consensus: Superb for academic queries. Search 200 million peer-reviewed research papers and get a summary and links to publications. Useful for scientific or other research questions, e.g. active vs. passive learning or how cash transfers impact poverty. Pricing: Free for unlimited searches and limited premium use; $9/month billed annually for full AI capabilities. ChatGPT Web Search. Turn on the 'Search the Web' option under the tools menu when using ChatGPT to enable Web searching. Search chats include inline links with sources. For example, here's a ChatGPT Web search query about Perplexity vs. other AI search tools. It includes a helpful ChatGPT-generated chart. As differentiators I like Perplexity's summaries, suggested follow-up queries, Labs, and the handy Voice Mode for quick questions.

I'm a retiree living in Mexico who owns a BYD and a Tesla. Here's why I prefer the Chinese car.

Business Insider

30 minutes ago

Business Insider

I'm a retiree living in Mexico who owns a BYD and a Tesla. Here's why I prefer the Chinese car.

This as-told-to essay is based on a conversation with John Romer, a retired radiologist from Huntsville, Alabama, about his BYD Song Plus hybrid and Tesla Model 3. It has been edited for length and clarity. I'm a 73-year-old former radiologist. I retired from practice in Huntsville, Alabama, nine years ago, after working there for 35 years. I retired to Florida, but after a while, my wife and I decided to move back to Alabama, where we bought a small home to be close to our children and grandchildren. But I'm pretty much living in Mexico most of the time now. I've been a resident for eight years, and I spend about nine months of the year in the country, going back to Alabama around four times a year to see family. Chinese cars have become extremely popular in Mexico, so when the time came to buy a new car, BYD was the logical choice. They've come in with a strong presence and have a number of dealerships scattered throughout the country. I test-drove a BYD Song Plus hybrid SUV back in early October and really liked it. I bought one for around 777,000 Pesos ($41,000), and it came in early November. I've had it for six or seven months, and it hasn't had a single issue. It's extremely efficient — I'm probably getting over 40 miles per gallon — and having a hybrid gives me the flexibility of being able to go wherever I want. I would have gone with a pure EV, but Mexico doesn't have the charging infrastructure yet that they have in Europe or the US, so I was more comfortable with the hybrid here. The Song is a very comfortable ride, and it also has a number of safety features that I like. It's very large, which can make it a bit difficult to drive, but it does have blind-spot warning and automatic front and rear braking if there's traffic that you haven't seen in front of you or behind you. Tesla vs. BYD Back home, I have a Tesla Model 3, which I bought two years ago for around the same price as I paid for the BYD Song Plus. Please help BI improve our Business, Tech, and Innovation coverage by sharing a bit about your role — it will help us tailor content that matters most to people like you. What is your job title? (1 of 2) Entry level position Project manager Management Senior management Executive management Student Self-employed Retired Other Continue By providing this information, you agree that Business Insider may use this data to improve your site experience and for targeted advertising. By continuing you agree that you accept the Terms of Service and Privacy Policy . I use it when I go home to Alabama, but for now, it sits in the garage most of the time. I'm probably going to give it to my grandson, who's turning 16 next year. Comparing a hybrid to a fully-electric car is like comparing apples and oranges, but I do prefer the BYD to the Tesla. The Tesla has been a good car, but it is a little bit troublesome. When I'm in Alabama and I drive to see my daughter in Kentucky, I have to stop to recharge it. It's not a huge deal, but it's a little awkward. I'm almost sure that if I had a hybrid, I wouldn't have to stop. I also don't care for Tesla's 'Autopilot' mode at all. I update it regularly, but it has yet to perform the way I think it should. For instance, in my experience, it cannot handle traffic circles, and it struggles around construction work. [Tesla did not immediately respond to a request for comment from Business Insider.] I don't think it's ready for primetime yet. The BYD has a similar setup, which I'm yet to use. The infrastructure in Mexico is not good, and there's a lot of construction around my area, so I'm a bit wary about using it here. The Tesla is also not as well put together, in terms of the finish, as the BYD Song Plus. The BYD feels very solid, the interior is very well upholstered, and in my opinion, it has a better quality of construction than the Tesla. I'm a real tech person, and the BYD has all kinds of advanced technology. As well as a heads-up display and automatic lights, it's got a nice 3D model of the car on the display — Tesla has one too, but it's not as accurate as the BYD one, which is really helpful when parking and maneuvering the car. US drivers miss out BYDs are everywhere in Mexico, and their prices are very competitive. I'm planning to buy my wife the BYD electric Dolphin Mini. We were going to get a golf cart to use around the community where we live, but the golf cart costs about $13,000, and you have to spend about $1,500 every two years replacing batteries. The BYD Dolphin Mini is $21,000 and has an eight-year battery guarantee, so it gives you a lot more flexibility for only $3,000 to 4,000 more. I don't agree with the tariffs on Chinese EVs in the US. I've always believed in free trade. I'm very disappointed in the tariffs that Trump is imposing. I think competition is always good, it encourages the development of new technology and better quality products. I do think that if you can't compete in an industry, then you need to find another industry you can compete in and let things get sorted out, rather than trying to artificially encourage industries with tariffs, which only drives prices up.

I let AI summarize every PDF I read — 6 prompts that saved me hours

Tom's Guide

an hour ago

Tom's Guide

I let AI summarize every PDF I read — 6 prompts that saved me hours

I have had to read so many PDFs for work. This can end up taking hours of your day, searching through documents for nuggets of information or one key figure buried near the end. However, I have now been using ChatGPT for a while to help me with this. The chatbot can become your best friend when it comes to PDFs, working through wealths of information to give you the answers you need. These six prompts are all you'll ever need to use with ChatGPT for your next PDF scroll. Using the prompt 'Summarize this PDF' is a lifesaver for long documents. Instead of having to work your way through 30-odd pages of text, this prompt will offer up a summary of all of the key details. This will give a quick overview, as well as picking out key bits of information. For example, on research papers, this prompt can layout the findings, methodology and key bits of information in one easy list. This also works for non-PDF documents. Try uploading a YouTube video or news story and using the same prompt to condense large amounts of information into an easy-to-digest system. Sometimes you'll be working through so many PDFs in one go, quickly scimming through each one to just find out a few key points. Skip this step by asking ChatGPT to 'Pick out the key points in this PDF'. Get instant access to breaking news, the hottest reviews, great deals and helpful tips. Similar to the prompt above, this will work through the document, offering some bullet points on the most important parts of the document. This will often include facts and figures, findings or an overall objective from the PDF. This can also be expanded to ask the same question but on specific chapters of a noticeably long PDF. This is an especially useful prompt for PDFs that feature a lot of quotes. Whether its for a research paper you're working on or if you're trying to find quotes to support a marketing plan, this can scan through a whole PDF looking for quotes. More importantly, if there is a ridiculous amount of text to work through, you can use this prompt more specifically. For example, 'Find quotes that are positive about the product from people in senior positions'. You can also ask ChatGPT where these quotes come from so you can go back and confirm all of the wording and attribution is correct. A prompt that can be a little hit and miss. "Extract all figures, tables and charts, explaining each" will do as it says. However, this is reliant on all of these tables and figures being readable to ChatGPT. If there is a table that has been photocopied or is in a particularly confusing infographic, it could be easily missed. However, in my time using this prompt, I have rarely seen it make a mistake. You can also ask ChatGPT to take the information and put it into tables of your own or compile all of the information gathered into one document. Why limit yourself to just the PDF you're reading? Ask ChatGPT to examine the internet for supporting articles and ChatGPT will provide a document of supporting content. Along with condensing all of this information into the chat, it will also provide links to important sources to follow up on. It is crucial to double check any information that comes from this. Because you are asking ChatGPT to directly check the internet and compare it to a PDF, it can cause some hallucinations in the crossover. A lot of the prompts above will extract information from a PDF, but they will do so in bullet points and summarized phrases. This one on the other hand will create a written report explaining everything that is in there. Simply ask ChatGPT to 'explain the PDF to me.' You can add extra information such as explaining it in simple terms or explaining it for someone focused on marketing to pick up key details. I have found a lot of success with the prompt 'Read through the PDF and become an expert in it. Then, explain it to me as an expert on the information.' This takes away the conversational tone and focuses on the importance of the information.