12 open-source agents tracked · Ranked by trust score · Updated 2026-05-30 20:03 UTC
HVTracker independently evaluates 12 open-source observability & evaluation using daily signals from GitHub, package registries, and security databases. Each agent is scored on activity, adoption, transparency, safety, and identity. The top-ranked observability & evaluation is Weights & Biases Weave with a trust score of 84.9/100 (Grade B). Other leading projects include Langfuse and Promptfoo.
| # | Agent | Trust | Stars | Language |
|---|---|---|---|---|
| 1 | Weights & Biases Weave B Weave is a toolkit for developing AI-powered applications, built by Weights & Biases. | 84.9 | 1.1k | Python |
| 2 | Langfuse B 🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Inte | 74.7 | 28.2k | TypeScript |
| 3 | Promptfoo B Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C | 73.8 | 21.7k | TypeScript |
| 4 | Arize Phoenix B AI Observability & Evaluation | 73.2 | 9.9k | Python |
| 5 | Giskard B 🐢 Open-Source Evaluation & Testing library for LLM Agents | 70.0 | 5.4k | Python |
| 6 | LangWatch B The platform for LLM evaluations and AI agent testing | 68.0 | 3.3k | TypeScript |
| 7 | DeepEval B The LLM Evaluation Framework | 61.9 | 15.8k | Python |
| 8 | Helicone C 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓 | 60.3 | 5.8k | TypeScript |
| 9 | Ragas B Supercharge Your LLM Application Evaluations 🚀 | 53.4 | 14.1k | Python |
| 10 | TruLens C Evaluation and Tracking for LLM Experiments and AI Agents | 39.5 | 3.4k | Python |
| 11 | Agenta C The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one | 38.6 | 4.2k | TypeScript |
| 12 | AgentOps C Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frame | 32.2 | 5.6k | Python |