What Is a Data Science Problem?

(And What Makes It Good?)

I recently co-led a session at my district’s professional development event that gathered math teachers from all levels and presented them with “good” math problems. We defined the problems as good because we enjoyed them, they made us think, they were challenging, and they were illuminating.

Many of the problems we selected came from the BU-sponsored PROMYS for Teachers program. This program, an extension of the highly regarded PROMYS program for high school students, introduces teachers to problems in Number Theory using an exploratory, inquiry-based approach. I highly recommend it to any math teachers looking to enrich their mathematical comprehension, appreciation, and depth of knowledge.

The feedback I got from this session was positive, mainly because of the problems themselves. They were cool, different, and interesting to puzzle over. They were satisfying without being exclusionary. There were often many, many ways to approach and solve each one. For example, my wonderful colleague Sarah Deitch, who co-led the session with me, brought a problem set that built the explicit formula for the Fibonacci Sequence that absolutely knocked my socks off. It had all the narrative intrigue of a good movie - unexpected encounters with friends thought long-lost (read: quadratics), seemingly insurmountable obstacles (read: writing the first sixteen terms of a three-times recursive sequence), epic conflicts with bitter enemies (read: re-deriving the formula for a finite geometric sequence). It got me thinking: What problems in data science have these qualities? Where in data analysis can we find those “low floor, high ceiling” experiences for students?

Low Floor, High Ceiling

To those unfamiliar, “low floor, high ceiling” is a term often used in mathematics education to describe problems that are easy to get started with and provide multiple avenues for rich problem-solving. One example I’ve used in my Algebra classes concerns summing consecutive natural numbers: 1 + 2 + 3 + 4 + …

The “low floor” comes from the initial simplicity of the problem itself. Every student can add the first few natural numbers and start to look for patterns. Then, we might ask:

What is the sum of the first 10 natural numbers?
What about the first 100?
What about the first n?
What about the first n evens? Odds? Multiples of 3?
What about the sum of the consecutive natural numbers from j to k?
What do the sums of natural numbers look like? If the numbers begin at 1 and proceed consecutively, they are called the “triangular numbers.” What are the characteristics of the triangular numbers?

…and so on. The ceiling gets pretty high!

In Data Science, these types of experiences can be harder to find. In the past, I’ve assigned my students “Data Scavenger Hunts” that proceed from “Mild” to “Medium” to “Spicy.” In one such assignment using a dataset of Electric Vehicles, I included:

MILD: Find the 10 fastest cars in the dataset. Which brand name appears the most often?
MEDIUM: Find the auto brand that manufacters the fastest cars on average.
SPICY: For cars with more than 5 seat capacity and released in 2023, find the brand that manufactuers the fastest such cars on average.

The skills involved progress from simple (arrange and count) to complex (filter twice, group, compute averages across groups).

This isn’t quite the same, of course. My course with high schoolers doesn’t allow for problems that uncover the mysteries of clustering algorithms or gradient descent or neural networks. But, as I continue to develop the curriculum, one of my goals is to start introducing more of a problem-solving atmosphere. I think about problems like the 0-1 Knapsack or Traveling Salesperson. These are easy to understand and very difficult to solve. It’s well within the capabilities of a person with high-school math skills to come up with elegant and implementable solutions to either of these problems, and both problems allow for expansion to higher order thinking.

For the 0-1 Knapsack problem, we might introduce the problem statement as:

You have to transport items to a consignment store in your backpack. You can only make one trip, so you want to maximize your potential earnings. Here is a list of items that you can bring, along with their value and the amount of volume they take up in your bag. If your bag can only hold a total volume of _____, what items will you bring, and what will you leave behind?

Think about the ways that this problem is already challenging. But, the floor is low - you can simply try out solutions. You can restrict the volume of the bag so that it can only hold a few items. You can also encourage really solid data science approaches: Removing lowest value items first, then removing the largest high value items, then adding extra items that happen to fit if there is still room when the bag is filled with low-volume, high-value items. This is an exercise in algorithmic thinking and really creative problem-solving. And the ceiling gets higher when you allow students to implement their solutions in Python or in R. (For my own solution of the Knapsack Problem with a Genetic Algorithm, see my project here!)

There are similar approaches we might take to a problem like the infamous Traveling Salesperson problem:

A salesperson needs to visit 10 cities, then return back home. What is the shortest path they can take?

This is such a good problem that I’m surprised that more high schoolers aren’t given it in other math classes. There’s so much going on, but the root of it is so simple that anyone can start (and in math, starting is often half the battle). A teacher could engage students as young as 10 with this problem, and it could continue to challenge those in AP Calculus. In Data Science, though, this is a great problem for introducing algorithmic thinking and pushing students to think about optimization, which can do a lot in building the concept of greedy algorithms and more. For students ready to take this far, they can start thinking about using randomness to supplement a greedy algorithm and even approach the concept of a genetic algorithm (for my implementation, see here).

All of this to say, every kid deserves problems that sound easy but are actually really cool. I say cool and not hard because great (sometimes easy) problems need not send the cognitive mechanisms into overdrive, but rather spark conversation and motivation for taking the work further.