GenAI is evolving at an unprecedented pace, with frequent releases of new large language models (LLMs) featuring performance improvements, efficiency gains, and new capabilities. Developers, researchers, and organizations looking to quickly leverage those model advances face the significant challenge of being able to consistently and reliably evaluate their performance and safety and determine which one is best suited for their use cases. To help address this need, Google DeepMind and Giskard are releasing LMEval, a large model evaluation framework, alongside the Phare Benchmark, an independent multi-lingual security and safety benchmark.
Toward Secure & Trustworthy AI: Independent Benchmarking
Available Media | Slides (pdf) Slides (Online) |
Conference | InCyber (InCyber Forum) - 2025 |
Author | Elie Bursztein |