← All agents

Observability & Evaluation

12 open-source agents tracked · Ranked by trust score · Updated 2026-05-30 20:03 UTC

HVTracker independently evaluates 12 open-source observability & evaluation using daily signals from GitHub, package registries, and security databases. Each agent is scored on activity, adoption, transparency, safety, and identity. The top-ranked observability & evaluation is Weights & Biases Weave with a trust score of 84.9/100 (Grade B). Other leading projects include Langfuse and Promptfoo.

12
Agents
61
Avg Trust
118.4k
Total Stars
0
Grade A
# Agent Trust Stars Language
1 Weights & Biases Weave B Weave is a toolkit for developing AI-powered applications, built by Weights & Biases. 84.9 1.1k Python
2 Langfuse B 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Inte 74.7 28.2k TypeScript
3 Promptfoo B Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C 73.8 21.7k TypeScript
4 Arize Phoenix B AI Observability & Evaluation 73.2 9.9k Python
5 Giskard B 🐢 Open-Source Evaluation & Testing library for LLM Agents 70.0 5.4k Python
6 LangWatch B The platform for LLM evaluations and AI agent testing 68.0 3.3k TypeScript
7 DeepEval B The LLM Evaluation Framework 61.9 15.8k Python
8 Helicone C 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓 60.3 5.8k TypeScript
9 Ragas B Supercharge Your LLM Application Evaluations 🚀 53.4 14.1k Python
10 TruLens C Evaluation and Tracking for LLM Experiments and AI Agents 39.5 3.4k Python
11 Agenta C The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one 38.6 4.2k TypeScript
12 AgentOps C Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frame 32.2 5.6k Python