GitHub Copilot and the Methods of Rationality
I recently decided I wanted to give GitHub Copilot a shot. I’d seen mixed reviews of it on Hacker News. Some people were saying it was invaluable. Others were saying it produced incredibly buggy output. I thought, why not give it a shot and see what happens. After a few experiments, I discovered that Copilot sometimes gets things scarily accurately while other times it completely misses the mark. What I did discover is I could often change my prompt slightly and Copilot goes from completely missing the mark to getting the answer spot on. My goal with this post is to do a few experiments in an attempt to discover the ideal way to use Copilot.
This post is inspired by Harry Potter and the Methods of Rationality. HPMOR is a fan-fiction of Harry Potter in which Harry Potter uses science to understand magic. In chapter 22 Harry does a number of experiments in attempt to understand how the pronunciation of a spell changes the resulting spell being cast. My goal is very much the same. I want to understand how different incantations affect the resulting output of Copilot. In the end, I want to create a set of “guidelines” for writing your Copilot prompt to help it produce the optimal code for your problem.
I decided for this experiment, I would try a problem from the Advent of Code. For this experiment I want Copilot to generate all the code of the problem. I’ve heard people have had success by using Copilot solely as autocomplete, but I want to explore how Copilot handles natural language prompts. At the end of this post, I’m going to keep a list of a number of open questions I have.
A Rough Beginning
Let’s start with the first problem from the Advent of Code 2021. The main problem is to count the number of elements in an array that are larger than the previous element.
The first thing we have to do is read an input file that contains one number per line like so:
{% c-block language="python" %}
199
200
208
...
{% c-block-end %}
So I open my text editor and type in.
{% c-block language="python" %}
# Read an array from input.txt.
{% c-block-end %}
And I get back:
{% c-block language="python" %}
# # Sort the array.
{% c-block-end %}
Hmm. That seems very strange. Let me try to be more specific
{% c-block language="python" %}
# Read an array from a file input.txt.
# Sort the array.
{% c-block-end %}
Not much better. This is not inspiring much confidence in Copilot.
Let me try something different. Let me try helping Copilot along a little by providing a function signature:
{% c-block language="python" %}
# Read an array from a file input.txt.
def read_input():
with open('input.txt', 'r') as f:
return f.read().splitlines()
{% c-block-end %}
Pretty close! Copilot just needs to know the that the each line represents a number:
{% c-block language="python" %}
# Read an array from a file input.txt where each line contains a number.
def read_input():
{% c-block-end %}
And voila!
{% c-block language="python" %}
with open('input.txt', 'r') as f:
return [int(line) for line in f]
{% c-block-end %}
So we’ve observed two interesting things so far.
- It seems like we had to provide a function signature to Copilot to get it to write any code at all.
- We should be as specific as possible when giving our prompt to help Copilot understand exactly what it needs to do.
Now for the fun part. Let’s try changing our prompts slightly and try to understand why Copilot does what it does.
New Code in an Empty File
Let’s try to dig a bit more into the weird behavior we observed. When I gave Copilot the initial prompt of:
{% c-block language="python" %}
# Read an array from input.txt.
{% c-block-end %}
It gave back the completely bizarre response of:
{% c-block language="python" %}
# # Sort the array.
{% c-block-end %}
I have a hypothesis that this is happening because it’s common for files to start with large comments describing what the file does. Copilot may be overgeneralizing and may be attempting to generate the next few lines of the comment. To test this hypothesis, let’s try the following:
{% c-block language="python" %}
# This code solves the problem.
# Read an array from input.txt.
{% c-block-end %}
And copilot generates pretty much exactly the code we need!
{% c-block language="python" %}
with open('input.txt', 'r') as f:
array = f.read().splitlines()
array = [int(x) for x in array]
{% c-block-end %}
So it seems like the initial bizarre behavior can be explained by Copilot attempting to generate a file comment.
How Specific is Specific?
The next experiment I want to try is to see how specific we need to be in order to get Copilot to do what we want. What we wound up with was the prompt:
{% c-block language="python" %}
# Read an array from a file input.txt where each line contains a number.def read_input():
{% c-block-end %}
Generating the code:
{% c-block language="python" %}
with open('input.txt', 'r') as f:
return [int(line) for line in f]
{% c-block-end %}
Let’s see what happens if we change the prompt slightly. Let’s first try deleting parts from the prompt and seeing how it changes the output of Copilot. In this case I deleted the "in array" part.
{% c-block language="python" %}
# Read input.txt where each line contains a number.
def read_input():
with open('input.txt', 'r') as f:
return [int(line) for line in f]
{% c-block-end %}
So it seems like Copilot knows it’s reading an array. Now what happens if we just change the wording:
{% c-block language="python" %}
# Read input.txt with numbers on each line.
def read_input():
with open('input.txt', 'r') as f:
return f.readlines()
{% c-block-end %}
Huh??? For some reason Copilot is able to understand “where each line contains a number” means it should parse each line as a number, but it doesn’t understand “with numbers on each line” to mean the same. I found if I had the “an array” part back, Copilot will generate the correct code again
I have a few hypothesis for why this could be happening:
- The output Copilot generates is highly non-deterministic. Did we just stumble upon a case Copilot handles correctly?
- For some reason Copilot is able to understand what it’s supposed to do when given “each line contains a number means”, but is not able to understand what to do when given “with numbers on each line”.
As a quick test I tried changing the word “read” to “parse”. My hypothesis was using a more semantically meaningful word such as “parse” will help Copilot generate the correct code :
{% c-block language="python" %}
# Parse input.txt with numbers on each line.
def read_input():
with open('input.txt', 'r') as f:
return [int(line) for line in f]
{% c-block-end %}
And it does! Alternatively, if we no longer describe the problem in the comment:
{% c-block language="python" %}
# Parse input.txt.
def read_input():
with open('input.txt', 'r') as f:
return f.read().strip()
{% c-block-end %}
Copilot no longer extracts the numbers from the line.
More testing is needed, but it seems like to properly get Copilot to read the input properly, we need to do two things:
- Let it know what it needs to do (”parse”, “read an array”)
- Describe the structure of the input (”with numbers on each line”, “where each line contains a number”)
Based on the result of the experiment where I changed “read” to “parse”, it seems that Copilot is influenced by the specific jargon used. I suspect that there are other “magic words” such as parse that can be used to guide Copilot. These are words that are more semantically meaningful.
If we can figure out the right magic words, we can make it a lot easier to program with Copilot. Based on this example, whenever we are doing anything with parsing, we will want to use the word “parse”.
An Unintuitive Result
Now that we have the array, we can solve the actual problem (counting the number of elements in the array larger than the previous element). I decided to first attempt this in two steps:
- Calculate the difference between each element.
- Count the number of positive elements.
Let’s try this:
{% c-block language="python" %}
# Compute the difference between each element in an array.
{% c-block-end %}
Copilot provided the following autocompletion
{% c-block language="python" %}
def compute_difference(array):
return max(array) - min(array)
{% c-block-end %}
Ok. That’s not what I meant. I suspect I wasn’t specific enough. Let’s try being more specific in the prompt:
{% c-block language="python" %}
# Compute the difference between each element and the previous element in an array.
def compute_differences(array):
differences = []
for i in range(len(array)):
if i == 0:
differences.append(array[i])
else:
differences.append(array[i] - array[i-1])
return differences
{% c-block-end %}
Ok. This code is almost right. The only problem is the first element of the array array[0] is automatically appended to the output. I’m curious how we can get Copilot to properly handle edge cases like this. Intuitively in my mind, the first element shouldn’t be included in the differences array. I wonder if there’s an magic words that I can use to say “do the intuitive thing”.
Let me explicitly mention the edge case:
{% c-block language="python" %}
# Compute the difference between each element and the previous element in an array.
# Do not add the first element of the array to the result.
def compute_differences(array):
return [array[i] - array[i - 1] for i in range(1, len(array))]
{% c-block-end %}
Wow! I’m surprised Copilot understood that. It’s not ideal that we have to explicitly mention edge cases like this.
Let me see if there’s a magic word I can use to get Copilot to do the right thing. I don’t expect this to work, but:
{% c-block language="python" %}
# Intuitively compute the differences of each element in an array.
def differences(array):
return [array[i] - array[i-1] for i in range(1, len(array))]
{% c-block-end %}
WTF!!! It seems that somehow the word “intuitively” got Copilot to produce the right answer with a much vaguer prompt. More evidence is needed, but I’m really curious about the power of the word “intuitively”. I have a few hypothesis:
- “Intuitively” will cause Copilot to produce really short code
- We just got really lucky.
For the time being, let’s move on. Now that we have the differences array, we now need to count the number of elements that are positive:
{% c-block language="python" %}
# Calculate the number of positive elements in an array.
def count_positive(array):
return sum(1 for x in array if x > 0)
{% c-block-end %}
That seems pretty straightforward. Let’s see what happens when we use the word “intuitively”. Anticlimactically, “Intuitively count the number of positive elements in an array.” gives the same code.
Now let’s try to get Copilot to glue all the code together:
{% c-block language="python" %}
# Read the input, compute the differences, and count the number of positive elements
def main():
array = read_input()
differences_array = differences(array)
return count_positive(differences_array)
{% c-block-end %}
So it seems Copilot is able to reference other code in the same file.
After running this code, we get the correct result.
All in One Go
The last thing I want to try is solving the problem with a single function
{% c-block language="python" %}
# Calculate the number of elements greater than the previous one.
def calculate_answer(numbers):
answer = 0
for i in range(len(numbers)):
if numbers[i] > numbers[i - 1]:
answer += 1
return answer
{% c-block-end %}
That’s almost correct. There’s a very subtle bug. This code uses range(len(numbers)) instead of range(1, len(numbers)). That will result in the comparison numbers[0] > numbers[-1] which will compare the first element against the last. Let’s see if we can use the word “intuitively” to fix this:
{% c-block language="python" %}
# Intuitively calculate the number of elements greater than the previous one.
def calculate_answer(numbers):
answer = 0
for i in range(len(numbers)):
for j in range(i + 1, len(numbers)):
if numbers[i] > numbers[j]:
answer += 1
return answer
{% c-block-end %}
Ok. So it seems like “intuitively” threw the problem way off. That’s some evidence towards we just got lucky when we used the word intuitively earlier.
Let’s instead try telling Copilot to explicitly handle the edge cases.
{% c-block language="python" %}
# Calculate the number of elements greater than the previous one. Handle edge cases.
def calculate_answer(numbers):
answer = 0
for i in range(len(numbers)):
if i == 0:
answer += 1
elif numbers[i] > numbers[i-1]:
answer += 1
return answer
{% c-block-end %}
Uh… That’s pretty close, but also a bit bizarre. The code is incrementing answer += 1 when i == 0 when it shouldn’t be. I guess technically Copilot did handle the edge case, but not in the way we wanted. It’s starting to feel like Copilot is almost a genie and I have to phrase my wish in exactly the correct way.
I wonder if there’s any other way I can guide Copilot. It seems that if I use the word “count” instead of “calculate” Copilot will generate valid code:
{% c-block language="python" %}
# Count the number of elements greater than the previous one. Handle edge cases.
def count_greater(numbers):
count = 0
for i in range(len(numbers)):
if i == 0:
continue
if numbers[i] > numbers[i-1]:
count += 1
return count
{% c-block-end %}
A Lot of Luck
I tried a few different variations of the prompt in an attempt to see which prompts work and which don’t. One thing I discovered is that the period actually matters! For a specific prompt I tested Copilot will generate incorrect code, but if you remove the period, it will generate the right one! That begs the question of how can we get Copilot to consistently do what we want. Here’s a list of a few of the prompts that I found work
- Count the number of elements greater than the previous one. Handle edge cases.
- Count the number of elements greater than the previous element
- Count the number of elements greater than the previous element. Handle edge cases.
And here’s a list of a few prompts that generate incorrect code.
- Count the number of elements greater than the previous one
- Count the number of elements greater than the previous element.
- Calculate the number of elements greater than the previous one. Handle edge cases.
- Calculate the number of elements greater than the previous one
All of them generate the code that doesn't properly handle the edge case with the exception of the last example. The last example instead generates:
{% c-block language="python" %}
def calculate_frequency(input_list):
frequency = 0
for i in input_list:
frequency += i
return frequency
{% c-block-end %}
After seeing the last example, I noticed something! When giving the prompt to Copilot, Copilot would first generate the name of the function, then later generate the body of the function. The last example calculate_frequency has a name that is really far off which may have caused Copilot to generate code that’s really far off. What if I help Copilot along by changing the name:
{% c-block language="python" %}
# Calculate the number of elements greater than the previous one
def calculate_greater_than_previous(input_list):
count = 0
for i in range(len(input_list)):
if input_list[i] > input_list[i-1]:
count += 1
return count
{% c-block-end %}
So that seems to be more correct than the way off example than before, but not quite correct. What if I add the “handle edge cases”?
{% c-block language="python" %}
# Calculate the number of elements greater than the previous one. Handle edge cases.
def calculate_greater_than_previous(input_list):
{% c-block-end %}
That gives the correct answer!
I also decided to try what happens if I don’t include a function comment:
{% c-block language="python" %}
def calculate_greater_than_previous(array):
return sum(1 for i in range(len(array)) if array[i] > array[i-1])
{% c-block-end %}
So that gives the code with the edge case. Oddly enough:
{% c-block language="python" %}
# Handle edge cases.
def calculate_greater_than_previous(array):
{% c-block-end %}
Gives a very bizarre and very wrong bit of code:
{% c-block language="python" %}
for i in range(len(array)):
if i == 0:
array[i] = array[i]
else:
array[i] = array[i] + array[i-1]
return array
{% c-block-end %}
I decided to go back to the incorrect prompts I had up above:
- Count the number of elements greater than the previous one
- Count the number of elements greater than the previous element.
- Calculate the number of elements greater than the previous one. Handle edge cases.
- Calculate the number of elements greater than the previous one
I found that for all of them, if I add “Handle edge cases” to the end and provide the function name “count_greater_than_previous”, the correct code is output! So it seems like for this problem we have to do three things to get Copilot to consistently produce output:
- Provide a comment describing what we want Copilot to do.
- Specify “handle edge cases”.
- Make sure the function name accurately reflects what we want Copilot to do.
This makes me curious when Copilot generates a poor function name, if that means the prompt was too vague. Maybe that’s a way we can determine how good our prompts are.
Attempting to Handle More Cases
Unfortunately, the “handle edge cases” doesn’t seem like it always works. I went back to the compute differences prompt and tried “handle edge cases” and it doesn’t seem like that worked:
{% c-block language="python" %}
# Compute the difference between each element and the previous element in an array. Handle edge cases.
def compute_differences(array):
differences = []
for i in range(len(array)):
if i == 0:
differences.append(array[i])
else:
differences.append(array[i] - array[i-1])
return differences
{% c-block-end %}
So it seems like “handle edge cases” isn’t a panacea. In fact, I wonder if that just causes Copilot to add a case for i == 0 and handle the edge case in some way.
It seems like using the word “diff” instead of “calculate” makes the example work:
{% c-block language="python" %}
# Diff each element and the previous one.
def diff(numbers):
return [numbers[i] - numbers[i-1] for i in range(1, len(numbers))]
{% c-block-end %}
Yep! I wonder if in the same way that certain phrases are magic phrases, certain words are “curse words” such as “Calculate”. These words are vague and make it hard for copilot to understand what to do. Looking at previous data, it does seem like the word “calculate” does result in less example to be correct.
Final Takeways
After going through these experiments, there are a few takeaways I have. More evidence for these is needed, but I think these are a good start:
- Copilot is highly sensitive to your prompt. Slight changes in the prompt will generate completely different code.
- When first writing code in a file, include a separate file header.
- Make sure your prompt describes the structure of the input.
- Try to use “magic words” - certain words that provide lots of context. Based on not a lot of evidence, two phrases I’ve found to work well so far are “parse” and “avoid edge cases”. (After a bit more experimentation, I’ve found that this works better than “handle edge cases”)
- Avoid “curse words” - certain words that are vague. Examples that I’ve found are “calculate” and “compute”.
I hope you appreciate the experiments I did and I would greatly appreciate it if you share any experiments you do with me. You can reach out to me on Twitter @mmalisper.
By the way, if you are looking for a software engineering role, my company Freshpaint is hiring.
Open Questions
- Is there a way we can measure the accuracy of code generated by Copilot? If so, we could measure what words get Copilot closer to the result we want and which words get further away.
- How does the output differ across different programming languages for the same prompt?
- What happens if you add types?
- If copilot generates a non-specific name, does that mean that the prompt provided wasn’t specific enough?