Sample Earth: Machine-Learning–Ready Land-Cover Reference Dataset

Share this to :

This dataset is part of the Sample Earth initiative, a global effort to build open, high-quality reference data for improving the accuracy and inclusiveness of land-cover maps. It contains GPS-located land-cover samples that can be used to train and validate AI models that generate detailed, accurate maps, with a focus on coffee and cocoa production systems.

The data were collected across Vietnam and Ghana, combining expert interpretation of high-resolution satellite imagery (Google Earth, Planet) with a smaller subset of ground-truth observations. Each point is labeled and quality-controlled to represent a diverse range of land-cover types commonly found within and around smallholder production areas. The classification scheme includes 10 main classes (such as coffee, cocoa, orchard, natural forests) and 68 sub-classes (such as full sun coffee, coffee intercropped with black pepper,

While the primary goal is to distinguish coffee and cocoa systems from other land uses, the dataset also supports broader applications such as agricultural monitoring, deforestation analysis, ecosystem service mapping, land-use planning, and suitability modeling.

By providing transparent, well-validated training data, this dataset contributes to Sample Earth’s broader objective: strengthening AI-based land monitoring tools and supporting global efforts, including the EU Deforestation Regulation (EUDR), to ensure sustainable, deforestation-free agricultural supply chains.

The dataset is designed to grow continuously, incorporating new commodities, timeframes, and countries over time.
Methodology:The dataset was developed primarily through expert visual interpretation of high-resolution satellite imagery from Google Earth and Planet, collected between 2019 and 2022. A smaller subset of points in the Central Highlands of Vietnam was derived from field observations, providing additional ground-truth validation.

To enhance interpreter accuracy and contextual understanding, field visits and Google Street View assessments were conducted in both Vietnam and Ghana. These activities helped experts better recognize local land-use patterns and distinguish among different crop and landscape types.

All sample points were digitized and standardized using QGIS, with attributes including class ID, crop type, sampling date, and associated metadata to ensure consistency and interoperability.

This combined approach of expert interpretation, localized training, and structured data management ensured a high-quality, consistent, and machine-learning–ready dataset suitable for land-cover mapping and model training workflows.

Vantalon, T.; Luong, P.T.; Perez Escobar, J.A.; Tello Dagua, J.J.; Phan, T.V.; Nguyen, H.; Hong Nguyen; Hoa Nguyen; Reymondin, L.

Share this to :