Eval Frameworks

Build systematic evaluations, not guesswork

Define custom evaluation metrics tailored to your LLM application.

Run evals on production data, catch regressions early, and iterate confidently with structured testing instead of ship-and-see.

Try LangSmith free. No credit card required.

LangSmith evaluation framework interface with test metrics

How LangSmith evaluation frameworks work

1

Define your evaluators

Build custom evals with Python, LLM-as-judge, or human feedback. Target the metrics your product actually cares about—accuracy, latency, cost, safety, or domain-specific requirements.

2

Run evals on your data

Test offline on datasets or online against production traffic. Automatically surface regressions and compare performance across prompt versions and model changes.

3

Ship with confidence

Gate deployments on evaluation thresholds. Close the feedback loop from production data to training datasets, turning real usage into continuous improvement.

LangSmith powers top engineering teams, from AI startups to global enterprises

Zip
Writer
Harvey
Vanta
Abridge
Clay
Rippling
Mercor
Listen Labs
dbt Labs
Klarna
Headspace
Lyft
Coinbase
Rakuten
LinkedIn
Elastic
Workday
Monday.com

Evaluation at Scale

Leading AI teams trust LangSmith for evaluation and quality assurance

50M+
LLM Calls Traced
1B+
Events Ingested per Day
100K+
Monthly active orgs in LangSmith SaaS

LangSmith Evaluation Framework Platform

Define, execute, and scale evaluations for continuous quality improvement

LangSmith captures complete execution traces from your LLM applications. These traces become the source of truth for your evaluation datasets, so you're always testing against real production patterns.

Connect with our team to see how
LangSmith trace interface showing LLM call details

Built for Enterprise

Security and compliance at scale

LangSmith meets the demanding security, performance, and collaboration requirements of large organizations building AI applications at scale.

Permissions icon

Granular permissions

Role-based access control with org-level permissions and project isolation to meet your security and compliance requirements.

Security certification icon

SOC 2 Type II

Third-party security certification with comprehensive security controls.

Trust center
Deployment icon

Self-hosted deployment

Self-hosting options to maintain full control over your AI data and meet strict compliance requirements.

Why top AI teams choose LangSmith for evaluation

Define what matters to you

Create custom evaluators with code, LLM-as-judge, or human feedback. Measure accuracy, latency, cost, safety—whatever your product requires.

Catch regressions before production

Run evals offline on test datasets or online against live traffic. Surface issues automatically so you ship with confidence.

Works with any framework

LangSmith evaluation works with OpenAI, Anthropic, open-source models, or custom implementations. Bring any LLM stack.

Customers

Elastic

"Working with LangSmith on the Elastic AI Assistant had a significant positive impact on the overall pace and quality of our development and shipping experience. We couldn't have delivered the product experience our customers now have without LangSmith—and we couldn't have done it at the same pace without it."

James Spiteri, Director of Security Product Management at Elastic

Read case study
Rakuten

"What we really needed was a more structured way to test new approaches, something better than just shipping and seeing what happened. LangSmith gave us a scientific, structured way to understand what was actually working. We could run pairwise evaluations and understand why accuracy jumped from 70% to 80%. Our engineers love the intuitive debugging experience."

Yusuke Kaji, General Manager of AI for Business Development at Rakuten

Read case study

Get a Demo of LangSmith Evals

Learn how to build evaluation frameworks that catch issues early and keep your LLM applications performing at their best.