A GitHub Copilot AI Anecdote

I’ve been watching Three Body—the 2023 Chinese television adaptation of the same Liu Cixin science fiction novel that Netflix also put out a version of. My copy was found on the high seas shortly after the release of the series, but it’s now available on a few streaming services. I was about 20 minutes into watching Episode 13 when the subtitles suddenly lost their sync with the video; text appeared on screen about 15-20 seconds after it was spoken. I tried Plex’s feature for downloading alternate subtitles, but each one exhibited the same problem. I spent $2 to just buy that episode from Amazon, skipped ahead to where I had left off, and even that had the same misaligned subtitles. This post is the story of how, having failed to buy my way out of the problem, I used AI to fix it.

I have a subscription to GitHub Copilot through my work, so after a rough assessment of the problem I opened the offending srt caption file in Visual Studio Code and submitted this prompt in the chat window:

The srt file I have open is incorrect; all captions are set 15 seconds too late beginning with caption number 225. Please correct the srt file, altering the time codes for all caption entries beginning with 225, to have them appear on screen 15 seconds earlier.

What I wanted was for the robot to just echo out a repaired srt file, but instead it offered a Python script that could fix the problem. Apologies for the spoiler, but anyone skimming this post for a solution to their own caption file problems might want to skip a bit farther down:

To adjust the time codes in an SRT file, you can read the file line by line, identify the lines that contain time codes, and adjust the time codes as needed. Here’s a Python script that does that:

import re
from datetime import timedelta

# Open the SRT file
with open('Three-Body.S01E13.CHINESE.WEBRip.x264-ION10[eztv.re].srt', 'r') as file:
    lines = file.readlines()

# Regular expression to match time codes
time_code_regex = re.compile(r'(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})')

# Time delta to subtract
time_delta = timedelta(seconds=15)

# Adjust the time codes
for i in range(len(lines)):
    # Check if the line is a time code line
    match = time_code_regex.match(lines[i])
    if match:
        # Check if the caption number is 225 or higher
        caption_number = int(lines[i-1].strip())
        if caption_number >= 225:
            # Parse the start and end times
            start_time = datetime.strptime(match.group(1), '%H:%M:%S,%f')
            end_time = datetime.strptime(match.group(2), '%H:%M:%S,%f')

            # Subtract the time delta
            start_time -= time_delta
            end_time -= time_delta

            # Format the new time code
            new_time_code = f'{start_time.strftime("%H:%M:%S,%f")[:-3]} --> {end_time.strftime("%H:%M:%S,%f")[:-3]}'

            # Replace the line with the new time code
            lines[i] = new_time_code

# Write the adjusted lines back to the file
with open('Three-Body.S01E13.CHINESE.WEBRip.x264-ION10[eztv.re].srt', 'w') as file:
    file.writelines(lines)

This script opens the SRT file, reads all the lines into a list, and then iterates over the list. For each line, it checks if the line is a time code line and if the caption number is 225 or higher. If both conditions are met, it parses the start and end times, subtracts 15 seconds from each, and replaces the line with the new time code. Finally, it writes the adjusted lines back to the file.

Please replace 'Three-Body.S01E13.CHINESE.WEBRip.x264-ION10[eztv.re].srt' with the path to your SRT file if it’s different.

The script looked decent enough at a glance, although it didn’t include any details about how to run the script. I knew to put it in a .py file, and although I have very little experience with Python I could guess that the script might need to be run with python3 rather than python on my Mac laptop (which turned out to be true, but perhaps due to some quirk of my own setup). The script also edits the file in-place, rather than saving the result out to a new file, which I would consider bad practice; arguably it should take the srt filename as an input parameter, but then again we’re just trying to solve a single, weird, one-off problem. So we’ll deduct a couple of “ease of use” and “best practice” points, but more importantly the script didn’t work!

First, it failed to import the “datetime” module—the syntax highlighting in VS Code made that obvious enough, but the IDE’s proposed fix of import datetime wasn’t correct, either. So I added from datetime import datetime to the top of the file and ran the script against my local copy of the srt. I didn’t bother to look very closely at the result—in my defense, it was 10:30pm and I was in “watching tv” mode, not “I am a software engineer” mode—and copied it to the Plex server. I restarted playing the episode where I left off and… now there weren’t any subtitles at all!

Let’s look at a snippet of the edited srt file to see if we can spot the problem:

223
00:16:25,660 --> 00:16:30,940
Correcting them is what I should do.

224
00:16:31,300 --> 00:16:45,660
At that time, I thought my life was over,
and I might even die in that room.

225
00:16:51,020 --> 00:16:54,500This is the person you want.
I've handled all the formalities.

226
00:16:54,500 --> 00:16:56,420You know the nature of this, right?

Subtitle formats are actually pretty easy to read in plaintext, which I appreciate. And a glance at the above snippet shows that GitHub Copilot’s script resulted in the timestamp of each entry running immediately into the first line of text of that caption. I’m trying to keep this relatively brief, so I’ll just note that a cursory search turned up a well-known quirk of Python’s file.readlines method (which reads a file while splitting it into individual lines of text), which is that it includes a “newline” character at the end of each line—and so the corresponding file.writelines method (which writes a list of lines out to a file) assumes that each line will end with that “now go to the next line” character if necessary. As someone who doesn’t often use Python, that’s an unexpected behavior, so to me this feels like a relatable and very human mistake to make. But anyone used to doing text file operations in Python might find it a strangely elementary thing to miss.

After adding the datetime import, fixing the missing linebreak, changing the script to save to a separate file with an “_edited” suffix, and adjusting the amount of time to shift the captions after some trial and error (not Copilot’s fault), we end up with this as the functioning script:

import re
from datetime import timedelta
from datetime import datetime

# Open the SRT file
with open('Three-Body.S01E13.CHINESE.WEBRip.x264-ION10[eztv.re].srt', 'r') as file:
    lines = file.readlines()

# Regular expression to match time codes
time_code_regex = re.compile(r'(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})')

# Time delta to subtract
time_delta = timedelta(seconds=18)

# Adjust the time codes
for i in range(len(lines)):
    # Check if the line is a time code line
    match = time_code_regex.match(lines[i])
    if match:
        # Check if the caption number is 225 or higher
        caption_number = int(lines[i-1].strip())
        if caption_number >= 225:
            # Parse the start and end times
            start_time = datetime.strptime(match.group(1), '%H:%M:%S,%f')
            end_time = datetime.strptime(match.group(2), '%H:%M:%S,%f')

            # Subtract the time delta
            start_time -= time_delta
            end_time -= time_delta

            # Format the new time code
            new_time_code = f'{start_time.strftime("%H:%M:%S,%f")[:-3]} --> {end_time.strftime("%H:%M:%S,%f")[:-3]}\n'

            # Replace the line with the new time code
            lines[i] = new_time_code

# Write the adjusted lines back to the file
with open('Three-Body.S01E13.CHINESE.WEBRip.x264-ION10[eztv.re]_edited.srt', 'w') as file:
    file.writelines(lines)

I saved that as fix_3body.py in the same directory as the srt file, and ran it from a terminal (also in that directory) with python3 fix_3body.py. And that did work—my spouse and I got to finish watching the episode. Hooray! I’m not going to share the edited srt file, but if you’re stuck in the same situation with episode 13, this should get you most of the way to your own copy of a correct-ish subtitle file (I don’t think the offset is exactly 18 seconds, but close enough).

I’ll close with a few scattered thoughts:

  • I wonder what caused the discrepency? My best guess is that the show was edited, the subtitles were created, and then approximately 18 seconds of dialog-free footage was removed from the shots of Ye Wenjie being moved from her cell and the establishing shot of the helicopter flying over the snowy forest.
  • Overall this felt like a success. As I said, I was not really geared up for programming at the time, and while I could have written my own script to fix this subtitle file, it would have taken a while to get started: what language should I use? How exactly does the SRT format work? How can I subtract 18 seconds from a timecode with fewer than “18” in the seconds position (like 00:16:07,020)? Should I have it accept the file as an input parameter? Maybe have it accept which caption to start with as an input parameter? Should that use the ordinal caption number or a timestamp as the position to start from? Even if I wouldn’t have made the same choices as Copilot for those questions, it got me to something nearly functional without my having to fully wake up my brain.
  • Of course, I did have to make at least two changes to make this script functional. Python is very widely used, especially in ML/NLP/AI (lots of text-mangling!) circles, and GitHub Copilot seems to be considered the “smartest” generally-available LLM fine-tuned for a specific purpose like this. Savita Subramanian, Bank of America’s head of US equity strategy, recently asserted on Bloomberg’s Odd Lots podcast (yt) that, “we’ve seen … the need for Python Programming completely evaporate because you can get … AI to write your code for you”. And not to pick on her in particular, but in my experience with AI thus far that’s false, especially the “completely evaporate” part. I’m a computer programmer, so you can discount that conclusion as me understanding in my own interest if you’d like.
  • I’m not sure there’s a broader lesson to be taken from this, but it struck me that a broken subtitle file feels like the kind of thing that one should expect when sailing the high seas for content. I was surprised when I saw that even Amazon’s copy of the episode had broken subtitles. In the end, controlling a local copy of the episode and its subtitles allowed me to fix the problem for myself. And in fact this kind of strange problem has been surprisingly common on streaming services. My family ran into missing songs on the We Bare Bears cartoon, and Disney Plus launched with “joke-destroying” aspect ratio changes to The Simpsons, just off the top of my head. I don’t believe I can submit my repaired subtitle file to Amazon, but it seems like there’s some kind of process in place to submit corrections to a resource like OpenSubtitles.org.

Leave a comment