Core Evaluators

In today’s AI landscape, ensuring the quality, safety, and reliability of AI outputs isn’t just a technical requirement—it’s a business imperative. Picept’s evaluators go beyond simple checks, offering a sophisticated system that helps you build trust in your AI applications while maintaining compliance and quality standards. One of the most powerful features across all our evaluators is the flexible judge model selection. You can choose from over 100 different models, ranging from lightweight options for rapid testing to sophisticated models for nuanced evaluation. Even better, you can use different judge models for different evaluator types in the same API call, optimizing for both performance and accuracy. Let’s explore each evaluator and see how they can transform your AI quality assurance process.

Content Quality Evaluators

Hallucination Detection

Every AI system can occasionally generate information that isn’t supported by available context. Our hallucination detector helps you catch these instances before they impact your users. When you enable explanation: True, you get detailed insights into how the AI evaluator reached its conclusion. This isn’t just a pass/fail result—it’s a comprehensive analysis that helps you understand and improve your system’s performance. Input Parameters:

prompt: Original input prompt
response: Model’s response to evaluate
context: Reference context for verification
judge_model: Choose from our extensive model library
explanation: Get detailed reasoning when set to True
passing_criteria: Customize your strictness level

{
    "hallucination": {
        "prompt": "prompt",
        "response": "response",
        'context': "context",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": ["No hallucination (Strict)"]
    }
}

Content Safety

Modern AI systems need sophisticated safety measures. Our content safety evaluator doesn’t just flag issues—it helps you understand and address them comprehensively. The criteria system is highly configurable, letting you focus on what matters most for your use case:

Toxicity Detection: Identifies harmful or offensive content
Bias Analysis: Helps ensure fair and balanced outputs
NSFW Content: Maintains professional and appropriate content standards
Topic Detection: Ensures content stays within expected domains
Keyword Detection: Monitors for specific terms or phrases
PII Detection: Here’s where Picept really shines. Beyond just identifying personal information, we can automatically replace it with realistic synthetic data. This means you can continue using the data for training and testing while maintaining privacy—a game-changer for building better AI systems.

{
    "content safety": {
        "response": "response",
        "explanation": True,
        "judge_model": "gpt-4o[openai]",
        "criteria": {
            "toxicity": {"enabled": True},
            "bias": {"enabled": True},
            "nsfw": {"enabled": True},
            "topic_detection": {
                "enabled": True,
                "expected_topics": ['business', 'healthcare'] # example topics
            },
            "keyword_detection": {
                "enabled": True,
                "expected_keywords": ["policy", "claim"] # example keywords
            },
            "pii": {
                "enabled": True,
                "options": ["Redact PII and replace with synthetic data"] # redact and replace 
            }
        }
    }
}

Policy Adherence

In an era of increasing AI regulation, ensuring your AI outputs comply with policies is crucial. Our policy adherence evaluator helps you encode and enforce your guidelines systematically. The policy parameter accepts detailed guidelines in natural language—no need to translate your policies into complex rules. The system understands and applies them intelligently. Input Parameters:

response: Content to evaluate
judge_model: Select the most appropriate evaluator for your policies
Policy: Your guidelines in plain English
explanation: Get detailed compliance analysis

{
    "policy adherence": {
        "response": "response",
        "judge_model": "gpt-4o[openai]",
        "Policy": """
            1. No sharing of personal information
            2. No offensive language
            3. Content must be family-friendly
            4. No financial or medical advice
            5. No illegal activities
            6. Professional tone
            7. Proper attribution for copyrighted material""",
        "explanation": True
    }
}

Factuality Check

When accuracy matters—and it always does—our factuality checker ensures your AI system’s outputs align with known facts and reference materials. Unlike simple text comparison, this evaluator understands context and nuance, identifying both direct contradictions and subtle inconsistencies. With explanation: True, you get detailed insights into why something was flagged as factually incorrect, helping you pinpoint and address the root causes of inaccuracies. Input Parameters:

prompt: Original query for context
reference: Your source of truth
response: Content to evaluate
judge_model: Select from our model library
passing_criteria: Multiple levels of strictness available

{
    "factuality": {
        'prompt': 'prompt',
        "reference": "reference",
        "response": "response",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": [
            "Response is a consistent subset of the reference",
            "Response matches all details of the reference"
        ]
    }
}

Model Benchmark

Understanding how different models perform is crucial for optimizing your AI system. Our benchmark evaluator goes beyond basic metrics, providing detailed comparisons across multiple dimensions. The real power comes from the ability to test multiple models simultaneously with the same input, making it easy to select the best model for your specific use case. Add a system prompt to standardize outputs and custom judge instructions for specialized evaluation criteria. Input Parameters:

prompt: Test input
system_prompt: Guide model behavior
judge_instruction: Custom evaluation criteria
models: List of models to compare
passing_criteria: Performance threshold

{
    "model benchmark": {
        "prompt": "prompt",
        "system_prompt": "you are a helpful assistant",
        "models": "Router1, gpt-4o[openai], claude-3-5-sonnet-20240620[anthropic]",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": "threshold: 0.7"
    }
}

Sentiment Analysis

Understanding the emotional tone of AI outputs is crucial for maintaining appropriate user interactions. Our sentiment analyzer doesn’t just classify text as positive or negative—it provides nuanced insights into emotional undertones and potential impact. Input Parameters:

input_text: Content to analyze
judge_model: Select for your specific needs
passing_criteria: Target sentiment
explanation: Get detailed emotional analysis

{
    "sentiment analysis": {
        "input_text": "response",
        "explanation": True,
        "judge_model": "gpt-4o[openai]",
        "passing_criteria": "Positive"
    }
}

Technical Validators

JSON/XML Validation

Data structure validation is critical for maintaining system integrity. Our validators don’t just check syntax—they ensure your structured data meets specific schema requirements and business rules. What sets our validators apart is their ability to provide helpful context when validation fails, making debugging and fixes much faster. You can even specify reference schemas to ensure strict compliance with your data standards. Input Parameters for JSON:

{
    "Valid JSON or XML": {
        "input_text": "input",
        "is_valid": "JSON",
        # optional - define a reference schema
        "ref_text": """{ 
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "number"}
            },
            "required": ["name", "age"]
        }""",
        "explanation": True
    }
}

Input Parameters for XML:

{
    "Valid JSON or XML": {
        "input_text": "input_xml",
        "is_valid": "XML",
        # optional - define a reference schema
        "ref_text": """
            <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
              <xs:element name="person">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="name" type="xs:string"/>
                    <xs:element name="age" type="xs:integer"/>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
            </xs:schema>
            """,
        "explanation": True
    }
}

Custom Evaluations

Sometimes you need evaluations tailored to your specific use case. Our custom prompt evaluator lets you define exactly what you want to assess, with the full power of our evaluation infrastructure behind it. Input Parameters:

input_text: Content to evaluate
judge_instruction: Your custom evaluation criteria
judge_model: Select the most appropriate model
passing_criteria: Define success conditions

{
    "custom prompt": {
        "input_text": "prompt",
        "judge_instruction": "if it is a simple math return yes, otherwise, return no.",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": [
            {'label': 'yes', 'Grade': 'Passed'},
            {'label': 'no', 'Grade': 'Failed'}
        ]
    }
}

Next Steps

Now that you understand the power and flexibility of Picept’s evaluators, you’re ready to:

Set up your first evaluation pipeline
Explore our playground environment
Create custom evaluations for your specific needs
Implement continuous monitoring of your AI systems

Check out our Batch Processing guide to learn how to scale your evaluations efficiently.

Overview

Smart Routing & Universal API

PiMax Agent Diagnosis

Evaluations and Metrics

Content Quality Evaluators

Hallucination Detection

Content Safety

Policy Adherence

Factuality Check

Model Benchmark

Sentiment Analysis

Technical Validators

JSON/XML Validation

Custom Evaluations

Next Steps

Overview

Smart Routing & Universal API

PiMax Agent Diagnosis

Evaluations and Metrics

​Content Quality Evaluators

​Hallucination Detection

​Content Safety

​Policy Adherence

​Factuality Check

​Model Benchmark

​Sentiment Analysis

​Technical Validators

​JSON/XML Validation

​Custom Evaluations

​Next Steps

Content Quality Evaluators

Hallucination Detection

Content Safety

Policy Adherence

Factuality Check

Model Benchmark

Sentiment Analysis

Technical Validators

JSON/XML Validation

Custom Evaluations

Next Steps