<aside>

Principal Authors: Michael Minkoff, Sivaramakrishnan Balasubramanian, Nupur Mishra, Kaustav Ghosal

</aside>

Contents

Acknowledgement

The authors gratefully acknowledge the generous time and insights shared by Jagadish Babu (COO, EkStep Foundation), JC Tripathi (Director, Agriculture, Wadhwani AI), Kumar Rajamani (Associate Director, CropIn), Praveen Pankajakshan (Chief AI Scientist, Urban Kisaan; Apex Committee member overseeing India's AI Centres of Excellence; Advisory Committee Member, Harvard Data Science Initiative), Parameswaran Iyer ([email protected]), and Nikhil Toshniwal (AgRevolution / DeHaat). Their consultations, reviews, and comments substantially enriched the clarity and scope of this discussion paper. We also acknowledge the contributions of several additional experts who offered guidance and commentary but preferred to remain unnamed. Finally, sincere thanks to the DevGlobal and DevAfrique teams for their thoughtful review, comments, and support on final production.

Glossary

Abbreviation Full Form/Definition
ACRE Africa Agriculture and Climate Risk Enterprise – Provides weather-index-based insurance to smallholder farmers in Africa.
AI Artificial Intelligence – Systems that simulate human intelligence, used in agriculture for personalized advice, diagnostics, and predictions.
API Application Programming Interface – A set of tools and protocols that allow different software systems to communicate and exchange data.
APS American Phytopathological Society – Referenced here in context of image databases used for pest and disease identification.
CARE Collective Benefit, Authority to Control, Responsibility, and Ethics – Ethical principles for data governance, especially for Indigenous and marginalized communities.
CGIAR Consultative Group on International Agricultural Research – A global partnership of research institutions focused on food security and agriculture.
CRM Customer Relationship Management – Systems used to manage interactions and data related to users (e.g., farmers).
DAAS Digital Agriculture Advisory Services – Part of India’s AgriStack; it aims to serve as a federated data utility.
DBT Direct Benefit Transfer – A system used in India to transfer subsidies or payments directly to beneficiaries’ bank accounts.
ETL Economic Threshold Level – The pest population level at which control measures should be implemented to prevent crop loss.
FAIR Findable, Accessible, Interoperable, Reusable – A set of principles that guide responsible data management and sharing.
FAO Food and Agriculture Organization of the United Nations – An international agency working on hunger eradication and sustainable agriculture.
FCDO Foreign, Commonwealth & Development Office (UK) – Cited for its evaluation of GODAN.
FPO Farmer Producer Organization – A farmer-owned group that improves access to inputs, services, and markets.
GARDIAN Global Agricultural Research Data Innovation & Acceleration Network – CGIAR’s platform for open agricultural data.
GODAN Global Open Data for Agriculture and Nutrition – An initiative promoting open data for agriculture and nutrition.
ICAR Indian Council of Agricultural Research – Apex body in India for agricultural education and research.
IDH The Sustainable Trade Initiative – An organization supporting sustainable trade through partnerships and data sharing.
ILRI International Livestock Research Institute – Research center focusing on livestock in developing countries.
INRIA French Institute for Research in Computer Science and Automation – Referenced for its work on agriculture and digital technology.
IoT Internet of Things – Network of connected physical devices that collect and exchange data (e.g., farm sensors).
IPCC SRCCL Intergovernmental Panel on Climate Change – Special Report on Climate Change and Land – A global scientific report on climate risks to land systems.
IVR Interactive Voice Response – Automated phone system used for delivering agricultural advisories in local languages.
KYC Know Your Customer – A process to verify the identity of users; used in linking farmers to financial services.
LMICs Low- and Middle-Income Countries – Countries with lower income levels, where most smallholder farmers are located.
MIS Management Information System – A tool or platform used to store and manage data, such as livestock health records.
NASA POWER Prediction of Worldwide Energy Resources – A NASA dataset useful for weather-based agricultural planning.
NGO Non-Governmental Organization – Non-profit groups that often provide agricultural training, services, and data.
NOAA National Oceanic and Atmospheric Administration – Provides climate and weather data, used in agriculture.
OD4D Open Data for Development – A global partnership to support open data initiatives.
PMFBY Pradhan Mantri Fasal Bima Yojana – India’s flagship crop insurance scheme.
PM-KISAN Pradhan Mantri Kisan Samman Nidhi – India’s income support scheme for farmers.
PoC Proof of Concept – A small-scale pilot project used to test the feasibility of an idea before wider rollout.
PxD Precision Development – A nonprofit organization delivering personalized agricultural advice via mobile platforms.
RAG Retrieval-Augmented Generation – AI technique where a model retrieves documents before generating a response, improving accuracy.
SSA Sub-Saharan Africa – A region of Africa south of the Sahara Desert.
UFSI Unified Farmer Service Interface – A digital switchboard under AgriStack that connects databases across government services.
VISTAAR Virtually Integrated System to Access Agricultural Resources – A federated, AI-powered digital advisory platform launched by India’s Ministry of Agriculture.

Building a Shared Data Corpus for Equitable Agricultural AI

1. Introduction

Farmers worldwide are navigating increasingly complex challenges, with smallholders particularly affected by climate variability, pest and disease outbreaks, soil degradation, volatile markets, and the growing demand to produce food sustainably. Approximately 733 million individuals experienced food insecurity in 2023 alone, underscoring the fragility of current agricultural systems and their limited capacity to meet global nutritional needs (FAO et al. 2024). While traditional agricultural knowledge remains vital, it often falls short in addressing the interconnected nature of today’s risks (Thudumu & Fisher, 2025). The COVID-19 pandemic further revealed systemic vulnerabilities in food systems, reinforcing the urgency for more resilient and inclusive agricultural frameworks (IARJSET, 2024; IPCC SRCCL Chapter 5, 2023). Responding to these evolving pressures requires a transformation toward more resilient, inclusive, and data-informed agricultural systems. In this context, technological innovations are playing a growing role in equipping farmers with tools to manage complexity and uncertainty. Among these innovations, Artificial Intelligence (AI) stands out for its ability to analyze large and varied datasets and generate tailored, location specific advisories (Asolo et al., 2024). By integrating information on weather, pests, soils, and markets, AI supports better farm level decisions while also advancing broader goals of climate resilience, resource efficiency, and equitable access to timely agricultural knowledge (Mana et al., 2024; Umar, 2023). AI technologies, such as precision farming and data-driven advisory systems, significantly contribute to this goal by empowering marginalized smallholder farmers with valuable insights and resources. By narrowing information gaps, these tools enable farmers to optimize resource use, make informed decisions, and ultimately improve their productivity and livelihoods (Gikunda, Kinyua. 2024).

However, the potential of AI in agriculture heavily depends on overcoming structural barriers such as climate stress, resource degradation, weak and fragmented infrastructure, and limited access to digital tools (Mana et al., 2024). Scaling AI solutions for a resilient agricultural future will require integrated data systems, affordable digital services, and robust financial support through targeted credit, insurance, incentives, and subsidy schemes to ensure widespread adoption and sustained impact. Additionally, obtaining trustworthy, curated, geo-tagged, and minimally noisy multi-seasonal data remains critical for effective agricultural AI, emphasizing quality over quantity.

Against this backdrop, this discussion paper explores the current landscape of, and opportunities to further enhance, agricultural data corpora for inclusive, equitable, and impactful AI-driven advisory systems. To do this, the paper draws on use cases from India and Sub-Saharan Africa (SSA), literature review, and a limited set of expert consultations, highlighting both learnings from existing solutions and initiatives, and recommended areas to focus in future implementation, investments, policies, and discussions, and looking across the broader suite of relevant stakeholders involved in effective corpora building and governance, including policymakers, academic institutions, financial providers, government agencies, and private sector actors.

2. Context and Problem

2.1 Why Corpus Quality Matters for Agricultural AI

Central to enhancing the promise of AI-powered advisory solutions for small-scale agricultural producers is the development of robust, adaptive, context-aware, and unbiased federated data corpora. These corpora must integrate high-quality datasets reflecting local agricultural realities, ensuring location-specific relevance while preserving privacy through trusted intermediaries, clear governance frameworks, consent management, and privacy-preserving technologies. Federated data systems enable decentralized data sharing across multiple actors without requiring central data pooling, thereby safeguarding privacy and encouraging participation (Kairouz et al., 2021). Building agricultural data corpora ideally involves inclusive collaboration across the agri-food system, from farmers and producer organizations to researchers, digital innovators, agronomists and policymakers. Such multi-stakeholder engagement ensures that corpora-building efforts consider, and seek to reflect, diverse agro-ecological contexts and farming practices, foster trust, embed co-design, and facilitate development of context-tailored AI tools (INRIA, 2022).

Data quality, relevance, and interpretability improve significantly when farmers, extension workers, co-operatives, farmer support organizations and local institutions are involved not only as data providers but also as co-designers and feedback loops in the system (Eastwood et al., 2019; Bronson, 2019). This participatory approach also strengthens trust, ownership, and inclusivity, addressing concerns around extractive data collection and use, and opaque algorithms behind AI advisory outputs. Importantly, interoperability standards such as open APIs and shared metadata protocols are key enablers of a federated approach, allowing diverse systems to exchange and interpret data in meaningful ways (FAO & ITU, 2022). Furthermore, adaptive governance frameworks anchored in fairness, transparency, and equitable benefit sharing help ensure that federated agricultural data corpora remain responsive to evolving climate conditions, policy shifts, and market trends (Stringer et al., 2020). Ultimately, such systems not only improve the precision of AI-driven advisories but also enhance their practical applicability and long-term sustainability in real-world farm settings.

The Role of Localized Data. Maize farmers in drought-prone Karnataka require vastly different advice from counterparts in humid Bihar, where fertile alluvial soils increase pest threats. Similarly, rice growers in West Bengal’s saline coastal belt face unique challenges compared to wheat farmers in waterlogged Punjab. In Kenya, maize farmers in Kitui dealing with drought conditions have distinct needs compared to those in Trans-Nzoia, where high humidity and fertile soils increase susceptibility to fungal diseases. Therefore, for AI tools to genuinely benefit farmers, the underlying data must deeply reflect these diverse agronomic conditions (Aroba and Rudolph, 2024; Munyao, 2024).

2.2 Layered Data for Accurate, Inclusive, Trustworthy Advisory

Developing a strong federated data corpus requires thoughtfully integrating both globally trusted agricultural knowledge and locally contextualized insights. Many foundational elements, such as crop growth stages, pest and disease life cycles, soil health parameters, nutrient management protocols, integrated pest management (IPM) strategies, and climate-resilient farming practices, can be drawn from global best practices (CGIAR, 2021; World Bank, 2022). These can then be enriched by local inputs, including vernacular content (Bhashini initiative launched by Govt. of India), traditional practices, region-specific cropping calendars, and microclimatic data, which fine-tune AI models to reflect real-world farming conditions.

Critically, an expert consultation with an AI specialist working closely with government and private sector stakeholders highlighted the importance of viewing datasets as layered across the hierarchy of Observation → Data → Information → Knowledge. Raw observations become structured data, analysed to generate meaningful information, ultimately forming actionable knowledge. This layering allows curated knowledge to be stored, retrieved, and continuously enriched by new insights, requiring ongoing human oversight (human-in-the-AI-loop) to ensure accuracy, trustworthiness, and local relevance.