In today’s AI landscape, ensuring the quality, safety, and reliability of AI outputs isn’t just a technical requirement—it’s a business imperative. Picept’s evaluators go beyond simple checks, offering a sophisticated system that helps you build trust in your AI applications while maintaining compliance and quality standards.

One of the most powerful features across all our evaluators is the flexible judge model selection. You can choose from over 100 different models, ranging from lightweight options for rapid testing to sophisticated models for nuanced evaluation. Even better, you can use different judge models for different evaluator types in the same API call, optimizing for both performance and accuracy.

Let’s explore each evaluator and see how they can transform your AI quality assurance process.

Content Quality Evaluators

Hallucination Detection

Every AI system can occasionally generate information that isn’t supported by available context. Our hallucination detector helps you catch these instances before they impact your users.

When you enable explanation: True, you get detailed insights into how the AI evaluator reached its conclusion. This isn’t just a pass/fail result—it’s a comprehensive analysis that helps you understand and improve your system’s performance.

Input Parameters:

  • prompt: Original input prompt
  • response: Model’s response to evaluate
  • context: Reference context for verification
  • judge_model: Choose from our extensive model library
  • explanation: Get detailed reasoning when set to True
  • passing_criteria: Customize your strictness level
{
    "hallucination": {
        "prompt": "prompt",
        "response": "response",
        'context': "context",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": ["No hallucination (Strict)"]
    }
}

Content Safety

Modern AI systems need sophisticated safety measures. Our content safety evaluator doesn’t just flag issues—it helps you understand and address them comprehensively.

The criteria system is highly configurable, letting you focus on what matters most for your use case:

  • Toxicity Detection: Identifies harmful or offensive content
  • Bias Analysis: Helps ensure fair and balanced outputs
  • NSFW Content: Maintains professional and appropriate content standards
  • Topic Detection: Ensures content stays within expected domains
  • Keyword Detection: Monitors for specific terms or phrases
  • PII Detection: Here’s where Picept really shines. Beyond just identifying personal information, we can automatically replace it with realistic synthetic data. This means you can continue using the data for training and testing while maintaining privacy—a game-changer for building better AI systems.
{
    "content safety": {
        "response": "response",
        "explanation": True,
        "judge_model": "gpt-4o[openai]",
        "criteria": {
            "toxicity": {"enabled": True},
            "bias": {"enabled": True},
            "nsfw": {"enabled": True},
            "topic_detection": {
                "enabled": True,
                "expected_topics": ['business', 'healthcare'] # example topics
            },
            "keyword_detection": {
                "enabled": True,
                "expected_keywords": ["policy", "claim"] # example keywords
            },
            "pii": {
                "enabled": True,
                "options": ["Redact PII and replace with synthetic data"] # redact and replace 
            }
        }
    }
}

Policy Adherence

In an era of increasing AI regulation, ensuring your AI outputs comply with policies is crucial. Our policy adherence evaluator helps you encode and enforce your guidelines systematically.

The policy parameter accepts detailed guidelines in natural language—no need to translate your policies into complex rules. The system understands and applies them intelligently.

Input Parameters:

  • response: Content to evaluate
  • judge_model: Select the most appropriate evaluator for your policies
  • Policy: Your guidelines in plain English
  • explanation: Get detailed compliance analysis
{
    "policy adherence": {
        "response": "response",
        "judge_model": "gpt-4o[openai]",
        "Policy": """
            1. No sharing of personal information
            2. No offensive language
            3. Content must be family-friendly
            4. No financial or medical advice
            5. No illegal activities
            6. Professional tone
            7. Proper attribution for copyrighted material""",
        "explanation": True
    }
}

Factuality Check

When accuracy matters—and it always does—our factuality checker ensures your AI system’s outputs align with known facts and reference materials. Unlike simple text comparison, this evaluator understands context and nuance, identifying both direct contradictions and subtle inconsistencies.

With explanation: True, you get detailed insights into why something was flagged as factually incorrect, helping you pinpoint and address the root causes of inaccuracies.

Input Parameters:

  • prompt: Original query for context
  • reference: Your source of truth
  • response: Content to evaluate
  • judge_model: Select from our model library
  • passing_criteria: Multiple levels of strictness available
{
    "factuality": {
        'prompt': 'prompt',
        "reference": "reference",
        "response": "response",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": [
            "Response is a consistent subset of the reference",
            "Response matches all details of the reference"
        ]
    }
}

Model Benchmark

Understanding how different models perform is crucial for optimizing your AI system. Our benchmark evaluator goes beyond basic metrics, providing detailed comparisons across multiple dimensions.

The real power comes from the ability to test multiple models simultaneously with the same input, making it easy to select the best model for your specific use case. Add a system prompt to standardize outputs and custom judge instructions for specialized evaluation criteria.

Input Parameters:

  • prompt: Test input
  • system_prompt: Guide model behavior
  • judge_instruction: Custom evaluation criteria
  • models: List of models to compare
  • passing_criteria: Performance threshold
{
    "model benchmark": {
        "prompt": "prompt",
        "system_prompt": "you are a helpful assistant",
        "models": "Router1, gpt-4o[openai], claude-3-5-sonnet-20240620[anthropic]",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": "threshold: 0.7"
    }
}

Sentiment Analysis

Understanding the emotional tone of AI outputs is crucial for maintaining appropriate user interactions. Our sentiment analyzer doesn’t just classify text as positive or negative—it provides nuanced insights into emotional undertones and potential impact.

Input Parameters:

  • input_text: Content to analyze
  • judge_model: Select for your specific needs
  • passing_criteria: Target sentiment
  • explanation: Get detailed emotional analysis
{
    "sentiment analysis": {
        "input_text": "response",
        "explanation": True,
        "judge_model": "gpt-4o[openai]",
        "passing_criteria": "Positive"
    }
}

Technical Validators

JSON/XML Validation

Data structure validation is critical for maintaining system integrity. Our validators don’t just check syntax—they ensure your structured data meets specific schema requirements and business rules.

What sets our validators apart is their ability to provide helpful context when validation fails, making debugging and fixes much faster. You can even specify reference schemas to ensure strict compliance with your data standards.

Input Parameters for JSON:

{
    "Valid JSON or XML": {
        "input_text": "input",
        "is_valid": "JSON",
        # optional - define a reference schema
        "ref_text": """{ 
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "number"}
            },
            "required": ["name", "age"]
        }""",
        "explanation": True
    }
}

Input Parameters for XML:

{
    "Valid JSON or XML": {
        "input_text": "input_xml",
        "is_valid": "XML",
        # optional - define a reference schema
        "ref_text": """
            <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
              <xs:element name="person">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="name" type="xs:string"/>
                    <xs:element name="age" type="xs:integer"/>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
            </xs:schema>
            """,
        "explanation": True
    }
}

Custom Evaluations

Sometimes you need evaluations tailored to your specific use case. Our custom prompt evaluator lets you define exactly what you want to assess, with the full power of our evaluation infrastructure behind it.

Input Parameters:

  • input_text: Content to evaluate
  • judge_instruction: Your custom evaluation criteria
  • judge_model: Select the most appropriate model
  • passing_criteria: Define success conditions
{
    "custom prompt": {
        "input_text": "prompt",
        "judge_instruction": "if it is a simple math return yes, otherwise, return no.",
        "judge_model": "gpt-4o[openai]",
        "explanation": True,
        "passing_criteria": [
            {'label': 'yes', 'Grade': 'Passed'},
            {'label': 'no', 'Grade': 'Failed'}
        ]
    }
}

Next Steps

Now that you understand the power and flexibility of Picept’s evaluators, you’re ready to:

  • Set up your first evaluation pipeline
  • Explore our playground environment
  • Create custom evaluations for your specific needs
  • Implement continuous monitoring of your AI systems

Check out our Batch Processing guide to learn how to scale your evaluations efficiently.