Diving Into the Data

We continue our quest to demystify how the Apple Watch estimates VO2 Max. Let’s take the plunge into the data and prepare it for analysis. If you’re tuning in for the first time, I’d recommend checking out the previous post to get up to speed. It’s worth the detour.

Apple Health Export Data

Thanks to the script we discussed last time, we converted the daunting export.xml file from HealthKit into a much friendlier apple_health_export.csv. Here’s a link to the python script: Apple Health export.xml to CSV Converter. Note, if you’ve been playing along at home, your CSV may have a date suffix.

Now, let’s talk about the CSV itself. It’s fairly large, my CSV was about 1.3GB (which isn’t crazy for nearly a decade of data). Within this file, you’ll find rows and rows of HealthKit entries. There are a bunch of columns, ranging from the type of data to the source, value, unit, and timestamps of creation, start, and end. (There are many other columns, but we will ignore these because they are more sparsely populated metadata.)

Only some of that data pertains to VO2 Max. Stupid ChatGPT joke:

Much of it is like that gym equipment you buy with great intentions – it’s there, but you’re not going to use it.

Here’s a sneak peek at what we’re dealing with:

type sourceName value unit startDate endDate creationDate
VO2Max Erkin’s Apple Watch 45.0789 mL/min·kg 2020-01-08 19:59:01-04:00 2020-01-08 19:59:01-04:00 2020-01-08 19:59:02 -0400
DistanceWalkingRunning Erkin’s Apple Watch 0.289404 mi 2020-01-08 19:42:40-04:00 2020-01-08 19:47:45-04:00 2020-04-09 07:19:11 -0400
DistanceWalkingRunning Erkin’s iPhone 6s 0.616122 mi 2020-01-08 19:46:19-04:00 2020-01-08 19:56:19-04:00 2020-01-08 19:57:22 -0400
DistanceWalkingRunning Erkin’s Apple Watch 0.306078 mi 2020-01-08 19:47:45-04:00 2020-01-08 19:52:49-04:00 2020-04-09 07:19:11 -0400
DistanceWalkingRunning Erkin’s Apple Watch 0.319039 mi 2020-01-08 19:52:49-04:00 2020-01-08 19:57:53-04:00 2020-04-09 07:19:12 -0400
DistanceWalkingRunning Erkin’s Apple Watch 0.0363016 mi 2020-01-08 19:57:53-04:00 2020-01-08 19:58:55-04:00 2020-04-09 07:19:12 -0400
ActiveEnergyBurned Erkin’s Apple Watch 39.915 Cal 2020-01-08 19:42:33-04:00 2020-01-08 19:47:37-04:00 2020-04-09 07:19:13 -0400

So, we need a way to extract only the data related to workouts. HealthKit is robust, and I’m sure that if I were doing this directly as part of an iOS application, I could use some of Apple’s APIs (like this). However, we’re not in Apple’s beautiful walled garden anymore - so we need a different way to extract the workout-related data. I was stymied at first because the extracted healthKit data don’t have any flag or metadata that indicate workout status. I know that specific sensors (like the heart rate monitor) sample at an increased frequency when a workout is started; however, I didn’t feel confident with an approach that tried to determine workout status implicitly. Then, I realized that the healthKit zip contains a directory called workout-routes.

Using Workout-Routes

The workout-routes directory contains a bunch of .gpx files. I’ve never seen this type of file before. They’re also known as GPS Exchange Format files and store geographic information such as waypoints, tracks, and routes. So, they’re an ideal file format to store recordings of your position throughout a walk or run. If you’re curious about these files, take a gander at these links:

In short, this directory contains a record of every run and walk that I’ve been on! And in addition to exercises having GPS coordinates, they have timestamps!

These files are a flavor of XML and contain a ton of trackpoints with timestamps. I asked chatGPT to whip up some code for extracting the first and last timestamps from the files (Prompt: “could you help me parse a gpx file? I would like to get the first and last time stamp from all the trkpts in trkseg”). With that little script, we can filter out the extraneous data.

Workout Health Data

I wrote a simple script to use the workout-routes to filter down the apple_health_export.csv. By matching the start and end timestamps of the GPX files with HealthKit data streams, I could isolate just the sensor measurements associated with each workout. To do this, I read through all the GPX files in the workout-routes directory and got the workout timestamps. Then, I opened the apple_health_export.csv and filtered out all rows that did not occur between the start or end timestamps of a workout.

You can find the workout health data extraction script here. The python script takes in the directory for workout-routes and the apple_health_export.csv file and returns workout_health_export.csv. Optionally, it takes in a parameter for the file path for this new CSV.

With this code, we now have a dataset of all the HealthKit samples that directly pertain to a running or walking workout (the workout types for which Apple calculates VO2 Max).

Jumping the (Data Analysis) Gun

At this point, I got excited because I had data! So, I jumped directly to machine learning; I did some more initial workout data preprocessing and called SkLearn to make some models. The results were… OK (MAE of ~1 for a value usually in the 30s).

Several hours into model selection, I realized I had jumped the gun. I decided to call back the cavalry and do a thorough job of data exploration before training models. This data exploration process is what we will focus on in the next post.

Go ÖN Home


I want to thank Emily A. Balczewski for reviewing this post and providing feedback on it and the project!