Machine Learning 101

By Abby Smith on February 10, 2020

With a background in landscape management and resilient food systems, machine learning was mostly a black box to me when I joined Upstream Tech. Over the past two years, I’ve come to better understand the incredible power of machine learning by working closely with our engineering team focused on the agricultural space. This post shares the basics of what I’ve learned and how it can help conservation organizations solve critical challenges in land use and resource management.

At Upstream Tech, we use satellite data and machine learning to help organizations better manage natural resources. As we shared in our Satellites 101 post, the availability of satellite data presents a tremendous opportunity for informing conservation decision-making. However, the frequency and abundance of data outpaces our ability to manually analyze it. Machine learning offers the promise of making analysis of satellite data possible at scale with unprecedented efficiency. The rapid insights afforded by these innovations can inform up-to-date decisions about how best to conserve resources and protect watersheds.

Let’s start with some language clarification basics. Artificial intelligence (AI) is a buzzy concept which refers to machines doing a task that we deem “smart.” Machine learning fits under the larger umbrella of AI and is fundamentally about training computers to learn from data. Where a typical computer program performs tasks by following rules, a machine learning system is able to reason and improve by picking up on patterns in data. Examples of machine learning are increasingly common in our daily lives, from spam filtering to being able to give voice commands to your phone to recognizing people in images.

There are three main components of machine learning:

Sample data: Samples are observations or examples, and training data is a crucial component of machine learning models. Exposing a model to new data is what enables it to learn. Often this input data is manually classified — think of the data labeling tasks Google sometimes asks for, like “select all street signs.” The more data the model sees, the easier it is to find relevant patterns and accurately predict results.

Captcha asking a user for sample data
Captcha asking a user for sample data

Features: Also called parameters or variables, features are the characteristics of all the samples that the model uses to find patterns. Features may include things like the sender’s email address that triggers a spam classification or green pixels in an image with a street sign.

Algorithms: Algorithms refer to the equation the model uses to get to an output. You can think of an algorithm as a basic set of rules for how to solve a problem. The computer uses the data and its features to figure out the answer. The goal of machine learning is to predict results based on new data. So once we’ve trained a model using ground truth data, we can use it to interpret new information.

At Upstream Tech, our goal is to save conservation organizations time and money by using machine learning to understand conditions on the ground and how they are changing over time. This was the motivation behind AgTrends™, a service which enables users to set informed baselines and efficiently track the adoption of agricultural management practices at a field and watershed scale.

We can help answer questions such as: are growers adopting conservation management practices such as cover cropping? To train our models, we need sample data, also known as ground truth data, on where and when cover cropping is happening. We show the model sample fields which have cover crops at a given time period and which fields do not. By giving the model labeled examples, it can train itself to mimic those examples when it makes new predictions.

A overview of how satellites work with ML
A overview of how satellites work with ML

So let’s take cover cropping as an example. Our machine learning approach for AgTrends™ includes five distinct steps:

1. We give the model information about vegetation presence on a farm field.

The time and location is important for when we overlay this ground truth data with satellite data. Information about vegetation, which we get from satellites, during certain periods is a good indicator of whether cover cropping is happening. For example, vegetation growth (Normalized Difference Vegetation Index in remote sensing lingo) in the spring leading up to the growing season can serve as a good indicator of cover crop emergence and is shown in the image below from the European Space Agency’s Sentinel 2 satellite.

Satellite imagery processed to view the vegetation index of an agricultural
Satellite imagery processed to view the vegetation index of an agricultural region

2. The model guesses whether cover cropping is present or absent on the field.

This is where the magic happens. The model takes in vegetation data, looks for features, and uses its algorithm to predict an output. Based on the vegetation presence and timing over the course of a growing season, the model generates an outcome — in this case guessing that a field is or is not using cover crops.

3. We tell the model whether it was correct or not.

We then determine whether or not the model produced a correct answer. If the model mistakenly classifies a field, we can inform the model of this incorrect output and it can learn to make better decisions in the future. This autonomous and dynamic ability to learn, predict, and improve is what makes machine learning so powerful.

4. The model adjusts a little bit to be more correct next time.

After repeating this process with many training examples, the model eventually learns how to take vegetation data for any field in a given year and predict whether or not cover cropping occurred with confidence.

5. We test the model’s performance.

The final step is evaluating the model’s performance using data it has not previously seen. In most cases, we use about 80% of ground truth data to train a model. Then, once we’ve trained the model, we run it on the remaining 20% of data that we held back from training to test how the model performs against real data. Once we have a model that performs well, we can use it to rapidly assess new areas of interest. Additionally, we can easily re-test and tune it with new ground truth data to apply it to a new region.

Analysis workflow from data to insight
Analysis workflow from data to insight

Given the time and resources that are necessary for farm-level surveying or field sensors, machine learning is well-positioned to alleviate some of the burden of manual data collection. We’ve used this approach to develop machine learning models for a range of practices. So far, our models can detect management practices such as cover cropping, tillage, riparian buffers, irrigation intensity, and others shown below:

Examples of what can be detected with ML and satellites
Examples of what can be detected with ML and satellites

By reducing the cost of agricultural assessments and ongoing monitoring, we want to help organizations allocate more resources to conservation adoption programs that support growers and protect watersheds — because at the end of the day, our goal is always more conservation, done better and more efficiently, at scale. Check out our website and reach out to learn more about how AgTrends™ could empower your work.