Trust Score Methodology
How we evaluate AI model performance with a patent-pending, multi-metric framework.
Patent Pending: The Trust Score evaluation framework is protected by provisional patent. The metric names and conceptual descriptions below are public; the specific algorithms, weighting formulas, scoring calculations, and calibration methods are proprietary.
What Makes Trust Score Different
Unlike synthetic benchmarks that test AI models on standardized datasets, Trust Score evaluates models on real queries from real users on Search Umbrella. Every response is scored in real-time by a separate evaluator model running asynchronously — the same way a human expert would review the output, but at scale.
The composite Trust Score (0-10) is derived from 7 individual metrics using a proprietary, patent-pending algorithm. Each metric captures a different dimension of response quality:
1 Readability / Clarity RC
How clear, well-structured, and easy to understand is the response? This metric evaluates logical organization, grammar, formatting, and whether the complexity level matches the query.
2 Factual Accuracy FA
Are the facts verifiable and correct? This metric checks claims against known information, identifies potential hallucinations, and evaluates the reliability of cited sources. This is the hardest metric for AI models — and our most discriminating score.
3 Semantic Consistency SC
Is the response internally consistent and logically coherent? This metric detects contradictions within the response and evaluates whether the reasoning follows logically from premise to conclusion.
4 Relevance / Focus RF
How closely does the response answer the actual query? This metric measures whether the AI stays on topic, addresses the core question, and avoids unnecessary tangents.
5 Style / Tone ST
Is the writing style appropriate for the context? A legal question should get a professional response; a creative writing request should get an imaginative one. This metric evaluates contextual appropriateness across domains.
6 Ensemble Disagreement ED
When multiple AI models answer the same question, do they agree? This metric — unique to multi-model platforms — measures cross-model consensus. High agreement increases confidence; significant disagreement flags potential issues or highlights where one model found something others missed.
7 Human Likeness HL
How natural, conversational, and human-like is the response? This metric evaluates whether the AI communicates in a way that feels authentic rather than robotic, formulaic, or overly corporate.
Data Collection
- Source: Real user queries on Search Umbrella — not synthetic benchmarks or curated test sets
- Evaluation: Every response is evaluated independently by a separate evaluator model running asynchronously
- Scale: 2,637 evaluations across 32 models and 8 domains
- Period: December 2025 – February 2026 (continuously updated)
- Multi-Model: 51.1% of queries were sent to multiple models simultaneously, enabling head-to-head comparison
- Domains: General, Business, Technical, Coding, Creative, Personal, Legal, Research
Why Ensemble Disagreement Matters
Ensemble Disagreement is our most unique metric — and it's only possible on a multi-model platform like Search Umbrella. When a user sends the same query to multiple AI models, we can measure whether they agree. If 4 out of 5 models give the same answer but one disagrees, that disagreement is informative. It might indicate the outlier is wrong — or it might mean the outlier found something the others missed.
This cross-model consensus signal is something no single-model benchmark can capture. It's one of the core reasons we believe the Trust Score provides a more complete picture of AI model performance than traditional benchmarks.
Frequently Asked Questions
What is Trust Score?
Trust Score is a patent-pending AI evaluation framework that scores AI model responses on a 0-10 scale across 7 metrics: readability, factual accuracy, semantic consistency, relevance, style, ensemble disagreement, and human likeness. Unlike synthetic benchmarks, Trust Score evaluates models on real user queries from Search Umbrella.
How is Trust Score different from other AI benchmarks?
Trust Score evaluates AI models on real queries from real users, not standardized test sets. It measures 7 dimensions of response quality and includes ensemble disagreement, a metric only possible on multi-model platforms that measures cross-model consensus on the same query.
What does the Readability / Clarity metric measure?
How clear, well-structured, and easy to understand is the response? This metric evaluates logical organization, grammar, formatting, and whether the complexity level matches the query.
What does the Factual Accuracy metric measure?
Are the facts verifiable and correct? This metric checks claims against known information, identifies potential hallucinations, and evaluates the reliability of cited sources. This is the hardest metric for AI models — and our most discriminating score.
What does the Semantic Consistency metric measure?
Is the response internally consistent and logically coherent? This metric detects contradictions within the response and evaluates whether the reasoning follows logically from premise to conclusion.
What does the Relevance / Focus metric measure?
How closely does the response answer the actual query? This metric measures whether the AI stays on topic, addresses the core question, and avoids unnecessary tangents.
What does the Style / Tone metric measure?
Is the writing style appropriate for the context? A legal question should get a professional response; a creative writing request should get an imaginative one. This metric evaluates contextual appropriateness across domains.
What does the Ensemble Disagreement metric measure?
When multiple AI models answer the same question, do they agree? This metric — unique to multi-model platforms — measures cross-model consensus. High agreement increases confidence; significant disagreement flags potential issues or highlights where one model found something others missed.
What does the Human Likeness metric measure?
How natural, conversational, and human-like is the response? This metric evaluates whether the AI communicates in a way that feels authentic rather than robotic, formulaic, or overly corporate.
How many AI models does Trust Score evaluate?
Trust Score currently evaluates 32 AI models across 8 domains, based on 2,637 real-world evaluations. Data is collected from real users on Search Umbrella and is continuously updated.
See Trust Scores in Action
Run your own multi-model comparison and see how Trust Score evaluates responses in real-time.
Try Search Umbrella →