Skip to main content

Performance Benchmark

Our SEAL pipeline consistently outperforms leading large language models (LLMs) across key financial NLP tasks, delivering superior accuracy and reliability for critical applications.

Performance Comparison

ModelFinancial PhraseBankEarnings Calls & Analyst ReportsHeadlinesAverage
SEAL Pipeline (ZQ)0.9800.9570.8650.934
OpenAI gpt-4o0.9280.7500.8240.834
Claude 3.5 Sonnet0.9440.6920.8270.821
OpenAI o1-mini0.9170.7200.7690.802
Qwen 2 Instruct (72B)0.9010.6390.8300.790
DeepSeek R10.9020.6880.7690.786
Mixtral-8x7B Instruct0.8930.5830.8050.760
Google Gemini 1.5 Pro0.8850.5250.8370.749
Claude 3 Haiku0.9080.5580.7810.749
DeepSeek-V30.8140.6750.7290.739
Gemma 2 9B0.9400.3650.8560.720
Jamba 1.5 Large0.7980.5410.7820.707
Mixtral-8x22B Instruct0.7760.5130.8350.708
Llama 3 70B Instruct0.9020.3860.8110.700
Mistral (7B) Instruct v0.30.8410.4120.7790.677
Llama 3 8B Instruct0.6980.5110.7630.657
DeepSeek LLM (67B)0.8110.1510.7780.580
Cohere Command R 7B0.8400.0680.7700.559
QwQ-32B-Preview0.8150.0200.7440.526
DBRX Instruct0.4990.3190.7460.521
Jamba 1.5 Mini0.7650.1510.6820.533
Cohere Command R +0.6990.1180.8120.543

ZQ_CLASSIFY Performance

We have also developed a separate classification system called ZQ_CLASSIFY that demonstrates superior performance across all benchmarks:

Comparative Results (averaged across 3 seeds)

DatasetZQ_CLASSIFYSNOWFLAKE_AI_CLASSIFYDATABRICKS_AI_CLASSIFY
Headlines0.8650.5330.524
Congressional Committee Hearings0.6600.6200.406
Financial PhraseBank0.9800.9280.904
Earnings Calls & Analyst Reports0.9570.9150.753
Datasets

A. Financial PhraseBank — Sentiment classification dataset of ~4.8k annotated financial sentences from news. Malo et al., JASIST, 2014

B. Claim Detection (Earnings Calls & Analyst Reports) — Numerical claim detection benchmark on analyst reports & earnings calls. ACL Anthology 2024.fever-1.21

C. News Headlines (Gold News Dataset) — Financial news headlines dataset introduced by Ankur Sinha & Tanmay Khandait (2020), used to extract multiple semantic dimensions (e.g. price direction, temporal reference, asset comparison) from gold-related news headlines. arXiv:2009.04202

D. CoCoHD: Congress Committee Hearing Dataset — Dataset for analyzing congressional committee hearings with price increase/decrease classifications. Arnav Hiray, Yunsong Liu, Mingxiao Song, Agam Shah, Sudheer Chava arXiv:2410.03099