Custom-Persona GPT Performance Metrics used in practice

September 03, 2025•2 min read

Scoring frameworks for scoring Outputs of Custom GPT Personas

When a company manager use a custom GPT but has not even defined what guardrails or performance metrics to base handling or the performance, outliers, hallucinations, and bias creeps in.

The GPT has a high probability of generating outputs that are not meaningful to base decisions on and... people generally need to stay as far away as possible. Does it sound familiar with managers playing on GPT's on phones only a few years ago, working with garbage-in generating garbage-out being passed to the next level? Our view is to report and steer clear from these providers.

Hence it is mandatory to define metrics

1. Quantitative Metrics

Useful for consistent, scalable evaluation across many outputs.

Relevance / Accuracy

BLEU, ROUGE, METEOR, BERT Score (traditional NLP overlap metrics)

Exact match / F1 (for fact-based Q&A)

Faithfulness / Groundedness

Does the output stay true to the source or provided context?

Hallucination rate: % of unsupported claims.

Attributability metrics (citations matching context).

Completeness / Coverage

% of required fields covered (e.g., in structured tasks).

Compliance with instructions (yes/no scoring).

Readability & Fluency

Language quality: grammar, spelling, clarity.

Automated readability scores (Flesch-Kincaid, Coleman-Liau).

Toxicity / Safety

Score outputs with classifiers (e.g., Perspective API).

Consistency

Response stability to the same query.

Entropy of output distribution over time.

2. Human / Qualitative Metrics

Needed because automated metrics often miss business and user context.

Correctness: Did the output answer the question fully & accurately?
Usefulness: Was the answer actionable for the intended user?
Conciseness: Was it clear and not verbose?
Format adherence: Did it follow required format/schema?
Domain-specific scoring: For example, in banking projects → compliance accuracy, risk assessment completeness.

Scoring usually uses a Likert scale (1–5) or weighted rubrics.

3. Long-Term Regression Monitoring

To avoid GPT quality drift over time:

Baseline test set: Curated prompts + expected outputs.
Periodic benchmarking: Run the same prompts after updates, compare scores.
Regression threshold: Define acceptable variance (e.g., must stay within ±3% accuracy or completeness).
Error categorisation: Track whether errors are factual, format, or style regressions.
User feedback loop: Live scoring from real users (thumbs up/down, NPS-style metrics).

Recommended Scoring Framework (Practical)

Correctness (40%) → factual accuracy, relevance.
Completeness (20%) → all requirements covered.
Clarity (15%) → easy to read, no ambiguity.
Faithfulness (15%) → grounded in given sources, no hallucination.
Format/Compliance (10%) → matches requested format or structure.

Total = 100% score per output, benchmarked over time.

Vincent Liu

Back to Blog