Impressed with ChatGPT's agricultural knowledge? CGIAR's open access effort (probably)* enabled it.
Featured image: “A group of researchers with diverse background amazed at chatgpt response on computer screen in a digital art style” (Generated by DALL-E, March 10, 2023)
By Jawoo Koo (IFPRI), Medha Devare (IITA), Brian King (Alliance Bioversity-CIAT).
2023 will be remembered as the year when the communication barrier between human and machine intelligence was breached, as online chatbots such as ChatGPT went public able to respond intelligibly to questions and even perform language-based tasks. It is possible that in the near future we could see these large-scale machine learning models having an impact in agri-food systems, for example by providing individually tailored advice to smallholder farmers.
One of the challenges for machine learning models is ensuring their factual accuracy and relevance, especially if vulnerable food producers are to take advantage of these tools. As these models are trained on available information, open access documents and open data are crucial for their potential to be realized.
*When asked if CGIAR’s open access publications are used by OpenAI in the training of ChatGPT, it declined to answer specifically by responding that “OpenAI has not specifically disclosed whether or not they used data from CGIAR’s open access documents for training GPT-3.” However, it additionally stated that “OpenAI uses diverse and representative data to train their language models, including ChatGPT, to ensure that the models do not exhibit bias or perpetuate stereotypes.”
CGIAR’s open access efforts
In 2016, as the CGIAR Platform for Big Data in Agriculture (Big Data) was being created, we wanted to illustrate what the future would be like once we made all CGIAR’s research outputs, both publications and data, open access. One of the ideas was a chatbot, aptly named Cigi, specifically trained to answer farmers’ questions, and help with their daily decisions in farming (Figure 1). For the next five years (2017-2021), while we did not make the chatbot reality, we worked with all CGIAR Center knowledge management teams to mainstream FAIR (Findable, Accessible, Interoperable, Reusable) in our research and make about 70% of our documents and digital assets open access (Figure 2).
Figure 1. A hypothetical use-case to show a chatbot application to help farmers’ decision-making in the field (Source: Koo, 2016, presented at the GCARD3 Global Event)
Figure 2. Trends of the access rights of CGIAR’s documents and digital assets over time (Source: GARDIAN, accessed 4 March 2023)
Why open access matters (even more now)
Fast forward to 2023, and we are now excited to see Cigi is here! Almost. Language models like GPT-3 power many chatbot applications and already generate human-like responses to a variety of questions. These models use sophisticated machine learning algorithms that analyze vast amounts of data on the internet to learn patterns and generate responses.
However, the quality of the model-generated responses is only as good as the quality of the data they are trained on. Here are three key reasons open access documents and open data are critical, especially in the context of providing relevant, accurate, and evidence-based responses on agrifood systems in the global South.
- More diversity: Research shows that most agricultural research publications do not provide solutions relevant to smallholder farmers. Models trained on literature from commercial-scale farming systems are unlikely to be useful in answering questions from small-scale producers. More data and evidence from open access documents covering a range of farming systems will improve the diversity of training data that language models learn from and help agri-food system actors address specific challenges.
- Reduced bias: While challenges faced by food, land, and water systems are much analyzed and discussed, the solutions and associated risks are not fully understood. Even for seasoned scientists, research findings are often nuanced and difficult to generalize in a simple question-and-answer session. Clearly, written research findings and lessons-learned stories will help reduce biases in language models and responses.
- Transparency: If inaccurate or biased responses are found, researchers can analyze the underlying data and documents used to train the model to identify the source of the problem, provide more training data, and further fine-tune the model. This can help improve the accuracy and reliability of large language models over time, making the models and applications more trustworthy and useful for a wider range of questions.
A call to action
Open access documents and open data that is findable, accessible, interoperable and reusable are essential resources to improve the accuracy and effectiveness of large language models (like GPT-3) and applications (like ChatGPT). Future advances may enable large language models to provide location-specific, more customized recommendations for farmers. Such advice will need to draw heavily on data, and combine complex analyses and data processing, enabling for example advisories on fertilizer rates based on weather and soil data at a particular location.
Achieving such goals will require not just open publications, but also open and FAIR data, allowing machines to source the correct quantitative information from appropriate databases. We, the researchers, are critical “humans in the loop” for making responsible use of LLMs. We can contribute to these goals by continuing to make our agrifood systems research findings FAIR – for machines, as much as for humans. Just being “open” is no longer enough – we need to also communicate research findings in ways that are more standardized (for machine interpretability and interoperability) and reproducible.
Without commitment and concerted efforts towards the goal of open and FAIR data, AI-based answers to our complex challenges will remain biased toward a small set of narrowly defined or highly generalized solutions that do not serve the smallholder, and a missed opportunity for systematic approaches toward a more inclusive and sustainable future.