Drafting My Fantasy Football Team with GPT-4

Figure 1: DALLE 2, "An american fantasy football wizard casting a spell on a computer, digital art, trending"

Introduction

"Gentlemen, it is better to have died a small boy than to fumble this football."

John Heisman

Fantasy football is a fun game to play with your friends.

Each year, your author finds enjoyment in drafting his fantasy football team using a series of needlessly complex algorithms. These rube-goldberg-esque attempts have been largely, if not entirely, futile in terms of improving outcomes. As discussed in last year's installment on this issue, there are much simpler, more time efficient, and all around better ways to draft your team. However, the point of fantasy football is not using your time well, its ~~beating your friends~~ having fun.

Every attempt so far has come in the form of what are now commonly known as "copilots": a helpful assistant that only provides recommendations. There was always a human in the loop ready to step in if things go haywire. Although these setups are currently in vogue, the goal this year was to take it a step further.

There would be no human present to represent the team on the night of the draft. The system would be entirely autonomous.

Building an Autonomous Draft Bot

Part of the tradition around this exercise is to try to use whatever is at the frontier of machine learning at the time. That is currently unquestionably GPT-4. What left is there to be said about GPT-4 and large language models? ¹

We can formulate the drafting problem as a function that takes in the current state of the team and a list of potential players to draft. It will then output a selection from the list. One of the benefits about using a language model is the ability to mix both language and quantitative statistical information. To get this information about the available players, we pulled from fantasypros.com ². For example, the textual representation for Patrick Mahomes was as follows. In this example, the "Expert Note" and "News" portion are truncated.

## Patrick Mahomes II (KC, QB)

### Expert Note
Arguably the best pure passer to ever play in the NFL, Patrick Mahomes enters
his age-28 season at the height of his powers. Mahomes led...

### News
Patrick Mahomes, other starters to start preseason game against Cardinals...

### 2022 Stats
|   | Pass Yds | Pass Tds | Rush Yds | Rush Tds | Points |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Last Week | 202 | 1 | 29 | 0 | 15 |
| Last 3 Weeks | 251.3 | 2 | 15 | 0.3 | 21.4 |
| Season to Date | 308.8 | 2.4 | 21.1 | 0.2 | 25.2 |

### Projections
| Pass Cmp | Pass Att | Pass Yds | Pass Tds | Pass Ints | Rush Att | Rush Yds | Rush Tds | Fumbles | Points |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 400 | 602 | 4757.1 | 35.5 | 11.1 | 63.7 | 333.3 | 2.9 | 1.9 | 368.4 |

### ADP
|   | Consensus | RTSports | Sleeper | CBS |
| :--- | :--- | :--- | :--- | :--- |
| Overall | #12 | #16 | #12 | - |
| Quarterbacks | #1 | #1 | #1 | #4 |

We created these representations for the three best players at every position group and concatenated them together.

After this, we added a table representing the current state of the team. This allowed the model to see what players they already had and what gaps were remaining. The following is a sample of what this table looked like:

## Starters
| QB1 | Patrick Mahomes | KC | 
| QB2 | Joe Burrow | CIN | 
| WR1 | Cooper Kupp | LAR | 
| WR2 | Tyler Lockett | SEA | 
| RB1 | Kenneth Walker | SEA | 
| RB2 | J.K. Dobbins | BAL | 
| TE | Dallas Goedert | PHI | 
| WR/TE | Jerry Jeudy | DEN | 
| RB/WR/TE | Brian Robinson | WAS | 

## Bench
| QB | Tua Tagovailoa | MIA |
| TE | Dalton Schultz | HOU |
| TE | Tyler Higbee | LAR |
| RB | Jerick McKinnon | KC |
| None | Empty | None |

To make sure the model behaved as intended, a system prompt was eventually added with more background information and desired behaviors. One commonly arising issue was a refusal to actually answer the question. The model would instead solely emphasize that it didn't know and recommend the user to think on it. This system prompt was able to fix that issue.

You are a Fantasy Football Draft bot. Users give you information about their team and players, and you tell them who to pick. Before you answer, you should always think through and consider the options before you make a decision. It is required that you give a specific recommendation of a single player to draft, even if you are not entirely sure. The league is a half-PPR two QB league. Make the last line just the recommended player name.

In general putting the "pipes" together was a straightforward but arduous process. The whole thing would not have been possible without sleeper's fantasy API. This thanklessly maintained API allows easily pulling live data during the draft. It does not allow picks to be made over the API, however, so that part would have to be manual. A walk through what happened for each pick went as follows:

The league commissioner would initiate the bot with a GET request to a web server, specifying the draft id.
The web server would pull the information about the draft, finding: the best player available players and the current players on the team.
The web server would pull the information about these players from fantasypros.com.
This information would be used to create the prompt.
This prompt would then be sent to GPT-4 using the openai Chat Completion API.
The response from this API would be formatted, and then sent back to the commissioner as text.
The commissioner would record the pick manually. There was no intervening or overruling picks allowed.

Evaluating the Bot

"It is better to be dumb on your own instead of letting other people be dumb for you."

A review of the system from one of its competitors.

The most interesting part of this project came after all of the hard techinical stuff was done. How do you evaluate something like this? Is GPT-4 the best model for the job? What about Claude? Or using a longer context? Or any number of design decisions?

Quantitative evaluations are infeasible. Using historical data is a no go, since the model already "knows" what happened in past NFL seasons. You can't run it on this year's data either because we won't know what the best decisions are until the season is over. If we could dynamically determine what the best pick at any point would be already, there would be no point in having a model in the first place.

The only option is qualitative evaluations. This mostly manifested as running it on mock drafts, seeing what picks it made and trying to evaluate what I thought was "right" in those scenarios. In other words, giving it the "eye test."

I quickly gave up on trying to use this process to inform any aspect of the system other than the obvious bug fixes. It was very slow and provided little signal. Evaluating these kinds of systems is increasingly becoming more and more like evaluating humans. It feels strange to take something that is so fundamentally quantitative and evaluate it based on your gut feelings.

It is useful to think about removing the AI part from the equation. If you wanted to pay a person rather than an API to draft for you, how would you go about evaluating them? Well, the obvious thing would be to look a their credentials but that might only get you so far. What you would likely end up doing is something reminiscent of the a standard interview process. Ask them questions to see what they think about different players or give them hypothetical scenarios and see how they pick.

In fact, I would argue that there is something nearing a fundamental law: The easier it is to quantitatively evaluate a task, the less economically useful that task is. Easily being able to compare between alternatives quickly turns a product into a commodity. Should you use GPT-4 or Claude for your sentiment analysis task? Just measure their performance, compare the differences in cost, and you're done. If a new model comes along you can just continually repeat this process until you get the cheapest, highest performing system.

This dynamic in inescapable. Think about how difficult we find evaluating the current state of the art in intelligence: people. Ask any hiring manager and they will tell you the difficulties of trying to find and evaluate talent. Hiring software engineers is a lot more than seeing how fast they can type and how many HumanEval questions they can do. It is a complex human-driven (and often flawed) process that tries to not only evaluate facts, but more nebulous concepts like "culture fit." When it comes down to it, the valuable things are the things you can't easily measure. Good luck putting a number on how well GPT-4 handles internal bickering between two teams with overlapping responsibilities.

There will come a time when you will have to try to decide between using a human software engineer or an AI software engineer. However, the process will be a lot like the one we have today. Maybe you prefer organic intelligence. Maybe you don't like having to speak with someone over the phone. But in the end, you just want to get your problem solved at the minimal cost.

Results

For all this talk, the model did a good job of assuaging any fears that this future might be imminent. Here are the picks it ended up making, in order. Before listing these, it is worth mentioning for those of you who care about these details: half ppr, super flex (2 QBs).

Nick Chubb (CLE, RB)
Trevor Lawrence (JAX, QB)
Amari Cooper (CLE, WR)
Najee Harris (PIT, RB)
George Kittle (SF, TE)
Russell Wilson (DEN, QB)
Diontae Johnson (PIT, WR)
Cam Akers (LAR, RB)
AJ Dillion (GB, RB)
Kadarius Toney (KC, WR)
Rashod Bateman (BAL, WR)
Mac Jones (NE, QB)
Rondale moote (ARI, WR)
Baker Mayfield (TB, QB)
DJ Chark (CAR, WR)
Kendre Miller (NO, RB)

There are some definite flaws with these picks. Picking four quarterbacks is not recommended, even for a 2 QB league. One benefit of this setup is that we can read the model's rationals for making it's pick. For example, here is what it said about selecting Nick Chubb overall:

Analyzing the available players, we can rule out Jonathan Taylor due to his injury and off-field situation - starting the season on PUP with a potential trade saga doesn't bode well for his fantasy steady production. Similarly, Travis Kelce has a significant injury concern with his knee, making him a risky choice.

Looking at the QBs, Dak Prescott underperformed last season and has some discouraging trends regarding his passing efficiency, therefore I would not recommend him as a top pick. Trevor Lawrence had an impressive rise in 2022, but Deshaun Watson might still hold the edge in overall talent and potential, though he's coming off a less effective season and has missed significant playing time in recent years, which can be a risk.

Stefon Diggs is consistently elite at wide receiver and is in his prime, and Tyreek Hill presents a tempting choice with his large target share and potential to be the overall WR1. However, they are outmatched by Cooper Kupp, who simply led all WRs in points per game last year. Even with concerns about Matthew Stafford's health, he remains a top choice at WR.

Comparing Nick Chubb and Bijan Robinson at RB, Chubb is the more proven player. Though Robinson is an exciting prospect, Chubb's consistency and the fact that Kareem Hunt and D'Ernest Johnson are no longer on the roster, lead me to prefer him.

Taking all this into account, my recommendation for your first pick would be Nick Chubb. He's a steady, reliable performer at RB with a big opportunity in the Browns' offense this season. He's almost certainly going to maintain much of that workload and could see an increased role in the passing game as well, increasing his overall fantasy value.

Nick Chubb

This sounds nice, but it is hard to say (for many of the reasons listed in the previous section) if this was actually overall better or worse than the basic autodraft functionality built into the website. Trying to analyze the picks in detail requires a greater fantasy football knowledge than your author can provide. But, what I can say is that the team it drafted not only has the least amount of points through three games, it also is the only team not to break 300 points overall. With this in mind, we can safely call this iteration of the bot another failure.

But that's the thing about bots. They only get better with time.

See you next year.

Footnotes:

Well, you could say some of this.

This is, to the best of my knowledge, not against the terms of service. But your author is famously not a lawyer.