Large language models (LLMs) like GPT often err when responding to queries derived from Securities and Exchange Commission (SEC) filings, according to a groundbreaking study conducted by researchers at the innovative startup Patronus AI. In their tests, the most advanced AI model, OpenAI’s GPT-4 Turbo, could only manage an accuracy rate of 79%. This level of performance is “absolutely unacceptable” for applications meant for production, according to Patronus AI co-founder Anand Kannappan.
Developing AI in Regulated Industries: A Difficult Landscape
The findings highlight the enormous difficulties AI models face in regulated fields like finance, where accuracy and dependability are crucial. LLMs frequently make mistakes, giving inaccurate statistics or flatly declining to respond to inquiries. This information is crucial because it could give financial firms a competitive advantage in the financial industry to be able to precisely summarize or analyze SEC filings.
FinanceBench: An Innovative Measure for AI Performance in the Financial Sector
The FinanceBench study, carried out by Patronus AI, comprises more than 10,000 questions and answers taken from SEC filings. Its goal is to establish a standard by which to evaluate AI’s effectiveness in the banking industry. The dataset increases the test’s complexity and rigor by requiring not only text extraction but also a minimal amount of mathematical reasoning.
Providing Trustworthy AI Applications in Industry
Patronus AI’s main objective is to provide software tools for automated life cycle management (LLM) testing. This is to ensure that trustworthy AI applications in business don’t spread false information. The business must now focus more than ever on guaranteeing the accuracy and dependability of these technologies due to the growing reliance on AI across numerous industries.