When you need to evaluate large volumes of AI interactions efficiently, Picept’s batch processing capabilities have you covered. Instead of making individual API calls, you can evaluate hundreds or thousands of LLM interactions in one go, analyzing everything from model outputs to conversation flows while maintaining comprehensive quality checks.

Running Batch Evaluations

Simply pass lists of inputs in your dataset, and Picept automatically handles the batch processing. Here’s a complete example:

payload = {
    'evaluation_name': "evaluation_name",  # Optional name for your evaluation job 
    
    'dataset': {
        "prompt": [
            "What is the scientific name of a domestic cat?",
            "What is the tallest building in New York City?"
        ],
        "response": [
            "The scientific name of a domestic cat is Felis sylvestris...",
            "The tallest building in New York City is the Freedom Tower..."
        ],
        "reference": [
            "Felis catus",
            "One World Trade Center, 1,776 feet"
        ],
        "context": [
            "The domestic cat, scientifically named Felis catus...",
            "One World Trade Center (also known as Freedom Tower...)"
        ]
    },
    'evaluators': {
        "hallucination": {
            "prompt": "prompt",
            "response": "response",
            'context': "context",
            "judge_model": "gpt-4o[openai]",
            "explanation": True,
            "passing_criteria": ["No hallucination (Strict)"]
        },
        "factuality": {
            'prompt': 'prompt',
            "reference": "reference",
            "response": "response",
            "judge_model": "gpt-4o-mini[openai]",
            "explanation": True,
            "passing_criteria": ["Response is a consistent subset of the reference"]
        }
    }   
}

response = requests.post(
    "https://api.picept.ai/v1/evaluation",
    json=payload,
    headers={
        "Authorization": f"Bearer {PICEPT_API_KEY}",
        "Content-Type": "application/json"
    }
)

Monitoring Batch Progress

Track the progress of your batch evaluations in real-time:

job_id = response.json()["job_id"]
evaluators = response.json()["evaluators"]

while True:
    try:
        status_response = requests.get(f"https://api.picept.ai/v1/evaluation/{job_id}/status")
        status_response.raise_for_status()
        
        status_data = status_response.json()
        
        print("\nCurrent Evaluator Statuses:")
        for eval_type, status in status_data["evaluator_statuses"].items():
            progress = status_data["progress"].get(eval_type, {})
            completed = progress.get('completed_batches', '0')
            total = progress.get('total_batches', '0')
            print(f"  {eval_type}: {status} (Progress: {completed}/{total})")
        
        if status_data["overall_status"] in ["done", "failed"]:
            break
            
    except requests.RequestException as e:
        print(f"Error fetching status: {str(e)}")
        continue
    
    time.sleep(5)

Flexible Input Methods

We understand that different teams have different needs when it comes to handling their evaluation data. That’s why we’ve built multiple ways to feed your LLM interactions into Picept’s evaluation system.

Direct API Upload

Want to programmatically send your data? Our API makes it seamless. Just pass your arrays directly in the payload, and we’ll handle the rest. It’s perfect for when you’re working with smaller datasets or need tight integration with your existing systems. The best part? Our system automatically optimizes the batch size for you while keeping you updated on progress in real-time.

CSV/JSON File Upload

Got a massive dataset sitting in spreadsheets or JSON files? No problem. Our UI makes it incredibly simple to upload these files directly. Just drag and drop, and we’ll take care of mapping the columns correctly. It’s especially useful when you’re dealing with historical data or need to run one-off evaluations on large datasets. And don’t worry about the format - if it’s a standard CSV or JSON file, we’ve got you covered.

Production Data Integration

Here’s where things get really interesting. Connect Picept directly to your production systems, and you can evaluate your LLM interactions as they happen. With Picept, you can run evaluations on your chat completions directly from your production environment – no need to extract or transform your data. Stream your evaluation data in real-time, set up continuous monitoring, and get instant alerts if something goes wrong. Better yet, schedule monitoring jobs to automatically evaluate your production data at regular intervals, helping you maintain and improve your AI application’s reliability over time. It’s like having a quality assurance team working 24/7, constantly learning from real user interactions to ensure your AI systems maintain high standards.

Think of it as having different gears in your evaluation engine - choose the one that best fits your speed and style. Whether you’re doing a quick test run with the API, analyzing months of historical data through file uploads, or leveraging your production data for continuous quality monitoring, we’ve built Picept to adapt to your workflow, not the other way around. You’re not just evaluating – you’re building a more reliable AI system that learns and improves from real-world usage patterns.

Best Practices

Optimization Tips

  1. Batch Size Planning

    • Balance speed and reliability
    • Consider model token limits
    • Group similar evaluations together
  2. Resource Management

    • Schedule large batches during off-peak hours
    • Monitor evaluation costs
    • Use appropriate judge models for scale
  3. Error Handling

    • Implement robust retry logic
    • Monitor partial failures
    • Save progress checkpoints

Data Preparation

  1. Input Formatting

    • Ensure consistent data structure
    • Validate inputs before submission
    • Handle missing values appropriately
  2. Reference Management

    • Organize reference materials efficiently
    • Version control your ground truth data
    • Consider caching for repeated evaluations

Analytics Dashboard

All batch evaluations are automatically logged in your Picept dashboard, providing a comprehensive view of your evaluation results and trends. The interactive dashboard turns your evaluation data into actionable insights:

The analytics dashboard offers:

  • Real-time progress tracking of ongoing evaluations
  • Detailed success/failure rates across different evaluator types
  • Interactive visualizations of evaluation trends
  • Comprehensive reports exportable in multiple formats
  • Team collaboration features for shared analysis
  • Custom alert configurations based on your metrics
  • Historical performance tracking and benchmarking

Each evaluation job gets its own detailed report page where you can:

  • Dive deep into individual evaluation results
  • Analyze performance patterns
  • Export specific data segments
  • Share insights with team members
  • Configure automated report distribution

Scheduled Evaluations

Take batch processing to the next level with scheduled evaluations:

  • Set up periodic evaluation jobs
  • Monitor trends over time
  • Receive anomaly alerts
  • Maintain continuous quality checks

Next Steps