<aside>
Principal Authors: Michael Minkoff, Sivaramakrishnan Balasubramanian, Nupur Mishra, Kaustav Ghosal
</aside>
What if a smallholder farmer in rural Kenya, southern Nigeria, or Andhra Pradesh, India could confidently say: “I have the best AI advisor available in my area”? What if policymakers and practitioners had tools to reliably compare AI advisory services based on trusted benchmarks and field validation?
These are no longer hypothetical questions. With the rise of generative AI (GenAI), advisory tools powered by large language models (LLMs) and retrieval augmented generation (RAG) are being deployed across agricultural systems to support decision-making, from pest management to climate-adaptive practices. Yet, alongside this expansion comes a pressing challenge: there is still no widely accepted way to assess how accurate, contextually relevant, or trustworthy these tools are—especially for small-scale producers in linguistically and agronomically diverse regions like Sub-Saharan Africa and India.
As AI-powered advisory tools become increasingly common in agricultural extension, a critical question emerges: how do we know which tools actually work for small-scale producers (SSPs)—and which work best for specific use cases best on geography, language, crop/pest types, etc.?
This paper explores the emerging landscape of GenAI benchmarking for agriculture, with a focus on tools and frameworks designed to assess LLM, RAG, and related, multi-modal advisors. It characterizes the current landscape of benchmarks and evaluation approaches, assesses their limitations and relevance, and synthesizes key dimensions for meaningful validation—accuracy, contextual relevance, trust, and responsibility. Drawing on global research, practical and field-based implementation, and stakeholder workshops, the paper outlines practical recommendations to help donors, developers, and governments improve benchmarking design, uptake, and institutional coordination.
Benchmarking, when done right, can help ensure GenAI not only works in the lab but serves farmers in the field—responsibly, equitably, and with confidence.
Agricultural extension—the provision of research-backed agronomic and crop/pest management knowledge and information to farmers through capacity building and education— as long been crucial for supporting the world’s 570 million small-scale farmers. Traditional extension systems, however, struggle to reach farmers at scale—public extension agent ratios often range from 1:500 up to 1:5000 farmers, far exceeding the recommended 1:100.
Over recent decades, various digital advisory tools emerged to bridge this gap. Early efforts included radio programs, call centers, SMS push alerts, and interactive voice response (IVR) systems. Many smallholders, especially women and marginalized groups, remained excluded due to digital illiteracy or inappropriate content (Digital Green 2025).
The emergence of GenAI offers the potential to transform digital agricultural advisory services, with the possibility of overcoming limitations of earlier digital approaches. The increasing onset and use of LLM, Small Language Model- (SLM) and RAG-powered agricultural advisory tools to address the information gaps faced by smallholder farmers—tools with the potential to provide personalized, contextually relevant advice—offers a pathway to overcoming the limitations of traditional extension services and prior digital agriculture interventions that often struggle with scalability, timely delivery, and/or sustainability, especially in remote areas (Singh et al, 2024) (Didwania et al, 2024).
However, for benchmarking to effectively overcome these known limitations, benchmarking solutions must be designed cognizant of 1) the key risks that LLM-powered agri-advisory tools introduce; and 2) the primary – and critical – dimensions that define strong benchmarking performance, notably: accuracy, local relevance, user trust, and safety.
In Section 2.2, we dive into the four critical dimensions mentioned above – accuracy, local relevance, user trust, and safety – as each addresses a key question for donors, implementers, and farmers themselves – and together define what “best” means for an AI farming advisor in practice. (Chang et al. 2023; Liang et al. 2022).
This section discusses the way that key risks—linguistic and cultural mismatch; contextual inaccuracy and irrelevance; unverified or unsafe recommendations; data governance and privacy concerns; and lack of consumer protection and liability—interact and influence benchmarking efforts. We present these risks as understanding these challenges in LLM-powered agri-advisory tools informs what benchmarking efforts must seek to address and flag, to ultimately help us know which tools actually work for small-scale producers—and which work best for specific use cases best on geography, language, crop/pest types, etc.
Brief description of the risk: