Growing up, I was a huge fan of the show Scrubs. It felt very goofy and safe, but at the same time would drop these sudden emotional moments that really hit hard. I've looked around, and other fans seem to agree (see here, here, and here, for example). In fact, the show was even rebooted recently, which inspired me to try out this project.
I had two goals as part of this project. First, I wanted to see if I could get a machine learning model to accurately predict how funny and sad a given scene in Scrubs is. Second, I wanted to see if, using the predictions from a model, I could use those scene-level rating predictions to explain and predict how people felt about an episode of Scrubs on IMDb. Specifically, I was interested in exploring my belief about the importance of those emotional gut-punches, and how they work together with the show's comedic scenes to keep us hooked.
This project repository represents the machine learning pipeline from start to finish. I scraped detailed episode transcripts from the Scrubs Fandom Wiki and episode ratings from IMDb (via the third-party ScrapingBee). I then had an LLM build me a quick labeling tool that I used to label 15 episodes. For each episode, I rated how funny and sad/emotional each scene was on a 1-5 scale while rewatching the episodes. From there, I exported the labels and merged them with the episode transcripts to fine-tune several variations of a DeBERTa-based classifier using Google Colab. Additionally, I tested Gemini 2.5 Pro, Flash, and Flash-Lite on the labeled scenes using few-shot prompting. I evaluated the performance of both approaches, and selected the highest-performing model from each to perform inference on all of the scraped transcripts. From there, I took those funny/sad scene predictions, aggregated them at the episode level, and used them to predict IMDb ratings for the episodes.
The steps below describe this pipeline in more detail
To get transcripts for Scrubs, I initially downloaded (and even labeled) subtitle (.SRT) files from TVSubtitles.net, but was pretty unsatisfied with the lack of visual detail in the subtitles. After researching a bit more, I stumbled upon fan-made transcripts available on the Scrubs Fandom page. It's worth mentioning that, due to this fan-made nature, there is some stylistic variation from episode to episode. Additionally, the transcripts end partway through season 6 (out of 8 - I choose to not acknowledge season 9). That said, even with these limitations, the Fandom transcripts include speaker labels and much more rich descriptions than the subtitle files. I felt that this level of detail would be beneficial for the fine-tuning step of my project.
For example, here's a comparison for the beginning of Season 1, Episode: 5:
- Subtitles:
How's he doing?
He has neutropenic fever.
His white blood cell count's stabilised.
He's not getting any worse.
How you feeling, Jared?
OK, I guess.
vs.
- Fandom Transcription:
Open: The Hospital -- The ICU -- A Patient's Room (daytime) Elliot stands at the end of a young boy's bed, reading his chart.
The boy, Jared, has his eyes locked on his TV. J.D. enters.
J.D.: [quietly, to Elliot] Hey, how's he doing?
Elliot: Well, he was admitted with neutropenic fever, but his white blood cell count's stabilized. Best I can say is he's not getting any worse.
They turn around to face the boy.
Elliot: [chipper] How ya feeling, Jared?
Jared: Okay, I guess.
Fandom has a generous API that made scraping the transcripts relatively straightforward. The scraper starts by finding all the season "subcategories" within the transcripts "category" for the show, and iterating through each season to scrape individual episode transcript pages. For each episode, it extracts the raw HTML from the corresponding page and removes irrelevant information such as navigation links, repeat lines, and ads, keeping only the actual lines of dialogue and descriptions. It then groups these lines into scenes of up to 150 words each. It generates a scene ID for each scene, and also assigns a position value between 0 and 1 measuring how far along in the episode the scene falls.
The decision to split scenes on word counts was mostly based on trial and error. I tried a couple other heuristics such as splitting scenes based on the main character's internal monologue, but that was inconsistent and sometimes scenes based on these heuristics would run on for too long. On the flip side, splitting an episode into too many scenes would also be problematic, since labeling more than ~50 scenes per episode would limit how many episodes I could cover.
As part of the project I wanted to obtain IMDb ratings for each episode so I could explore the relationship between my predicted funny/sad ratings and the IMDb ratings. Unfortunately, IMDb requires creating an AWS account to access its API, which felt out of scope for the task at hand. Instead, I found this ScrapingBee article and used their free trial. To determine which pages to scrape, I manually compiled a list of episode URLs. The IMDb scraping script iterates over the episodes in this list, and extracts and saves the distribution of user ratings for each episode.
The Fandom scraper has no inputs. The IMDb scraper takes in the list of episode URLs stored in scrapers/imdb/imdb_episode_urls.txt.
The directory of transcript JSON files is saved to data/transcripts/, split by season. The IMDb ratings are similarly saved to data/imdb_episode_ratings.json.
Given the scope of this project, I wanted to speed up my labeling process. I had Cursor generate a small labeling app using Flask and DuckDB so I could keep track of funny and sad ratings for each labeled episode. The app included a function to return a JSON file with the list of labels.
I labeled funny and sad ratings for scenes in 15 episodes. I picked a range of episodes based on their tone, and specifically tried to include some of the more famous, gut-wrenching episodes (such as Season 3, Episode 14 where a recurring character suddenly dies) and some of the lighter, sillier episodes (like Season 4, Episode 7 where the main character's brother spends a lot of time drinking beer in the bathtub). While labeling scenes, I often followed along by watching the episode in real time. I do confess, however, that I remembered some episodes well enough to label scenes without needing to rewatch.
The labeling_app/export_labeled_scenes.py script (written by me) merges the funny and sad labels with the text from the scene. It also adds the text from the previous scene and an episode ID based on the scene ID.
The labeling app uses the scraped scenes in data/transcripts/.
The script exporting labeled scenes saves a JSON file to data/labeled_scenes.json with the labels, current/previous scene texts, episode/scene IDs, and scene position in the episode.
As discussed above, I sought to try two different approaches with this project, predicting funny/sad ratings with a fine-tuned, locally run model and then predicting those same ratings by prompting a large language model.
Based on our experience in our course, I wanted to use a BERT-based model since I could add a classification head (or two) to the pre-trained model. Based on what I read online, DeBERTa was particularly effective for encoding text for classification tasks with relatively few training data, as was my case.
The DeBERTa/fine_tune.ipynb notebook, which I ran on Google Colab, adds two classification heads to the pre-trained model, one each for funny and sad ratings. For each scene, I encode the text from that scene and the previous scene, and then concatenate both, along with the difference between the two. I also concatenate the scene position scalar. As such, my fine-tuned model ends up with
I tried a combination of model parameters:
- Using DeBERTa-v3-small vs. DeBERTa-v3-base (larger),
- Adam optimizer with decay vs. stochastic gradient descent
- Applying dropout vs. not
- Use the returned classifier token to interpret DeBERTa encodings vs. taking the mean across the sequence of returned tokens
Using the validation set, I ran 25 epochs on each combination of the parameters above and kept the checkpoint with the best validation accuracy for each combination of parameters. I then used these checkpoints for inference, and saved the resulting predictions for the model evaluation step.
Using the same labeled scenes we used to fine-tune DeBERTa, the gemini/predict_labeled_scenes.py script uses two-shot learning to make predictions, based on the following prompt:
You are scoring scenes from the TV show Scrubs on how funny and sad they are on a 1 (least) to 5 (most) scale.
Example 1 — Scene: Jordan: It's Jack's first birthday. I want it to be special. I got a petting zoo for the kids. Cox: How about a Russian Roulette booth? We put bullets in ALL the chambers. That way everyone wins! J.D.: Will there be a piñata? Because I need to know if I should bring my piñata helmet. Jordan: Would you zip it, nerd? The only reason I invited you is because you have your own Spongebob Squarepants costume.
Funny: 4
Sad: 1
Example 2 — Scene: Dr. Cox: Time's up. Carla, would you do it for him, please? J.D.: Why are you telling her? Dr. Cox: Shut up and watch. Dr. Cox: Why does this GOMER got to try and die everyday during my lunch? J.D.: That's a little insensitive. J.D.'s narration: Mistake.
Funny: 2
Sad: 2
Now rate this scene.
Location in episode (0 = beginning, 1 = end): LOCATION
Previous scene: PREVIOUS SCENE TEXT
Scene: SCENE TEXT
Respond in exactly this format:
Funny: <number>
Sad: <number>
The script prompts three different models, Gemini 2.5 Pro, Gemini 2.5 Flash, and Gemini 2.5 Flash-Lite, saving the resulting predictions.
The fine-tuning notebook and prompting script use the labeled scenes in data/labeled_scenes.json.
The notebook saves the model checkpoints to DeBERTa/models, which I uploaded here and the prediction JSON files to data/DeBERTa_predictions/labeled_scenes/. The prompting script saves the prediction JSON files to data/gemini_predictions/labeled_scenes/.
The evaluate_models/evaluate.py script iterates over all DeBERTa and Gemini predictions for the labeled scenes, and computes accuracy, precision, recall, and F1 for each class for funny and sad predictions. These metrics are averaged across the classes to create model-specification-level metrics, as shown below:
| Model Specification | Type of Prediction | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| microsoft_deberta_v3_base_AdamW_with_dropout_cls_results | funny | 0.3361 | 0.1468 | 0.2042 | 0.1433 |
| microsoft_deberta_v3_base_AdamW_with_dropout_cls_results | sad | 0.479 | 0.1403 | 0.1849 | 0.1541 |
| microsoft_deberta_v3_base_AdamW_with_dropout_mean_pooling_results | funny | 0.3025 | 0.1404 | 0.1892 | 0.1476 |
| microsoft_deberta_v3_base_AdamW_with_dropout_mean_pooling_results | sad | 0.437 | 0.1539 | 0.1886 | 0.1694 |
| microsoft_deberta_v3_base_AdamW_without_dropout_cls_results | funny | 0.3361 | 0.1272 | 0.211 | 0.1526 |
| microsoft_deberta_v3_base_AdamW_without_dropout_cls_results | sad | 0.437 | 0.1589 | 0.1949 | 0.175 |
| microsoft_deberta_v3_base_AdamW_without_dropout_mean_pooling_results | funny | 0.3361 | 0.2684 | 0.1974 | 0.1144 |
| microsoft_deberta_v3_base_AdamW_without_dropout_mean_pooling_results | sad | 0.563 | 0.3111 | 0.2125 | 0.1664 |
| microsoft_deberta_v3_base_SGD_with_dropout_cls_results | funny | 0.3277 | 0.2966 | 0.2418 | 0.2266 |
| microsoft_deberta_v3_base_SGD_with_dropout_cls_results | sad | 0.5126 | 0.1866 | 0.2194 | 0.1981 |
| microsoft_deberta_v3_base_SGD_with_dropout_mean_pooling_results | funny | 0.3445 | 0.1702 | 0.2023 | 0.1166 |
| microsoft_deberta_v3_base_SGD_with_dropout_mean_pooling_results | sad | 0.5294 | 0.1578 | 0.197 | 0.1489 |
| microsoft_deberta_v3_base_SGD_without_dropout_cls_results | funny | 0.3866 | 0.2347 | 0.2674 | 0.2359 |
| microsoft_deberta_v3_base_SGD_without_dropout_cls_results | sad | 0.395 | 0.1432 | 0.17 | 0.1551 |
| microsoft_deberta_v3_base_SGD_without_dropout_mean_pooling_results | funny | 0.3697 | 0.2721 | 0.2237 | 0.1553 |
| microsoft_deberta_v3_base_SGD_without_dropout_mean_pooling_results | sad | 0.479 | 0.1666 | 0.2039 | 0.1826 |
| microsoft_deberta_v3_small_AdamW_with_dropout_cls_results | funny | 0.2857 | 0.16 | 0.1885 | 0.1627 |
| microsoft_deberta_v3_small_AdamW_with_dropout_cls_results | sad | 0.5546 | 0.2593 | 0.2376 | 0.2237 |
| microsoft_deberta_v3_small_AdamW_with_dropout_mean_pooling_results | funny | 0.3277 | 0.2147 | 0.2152 | 0.1899 |
| microsoft_deberta_v3_small_AdamW_with_dropout_mean_pooling_results | sad | 0.5042 | 0.1165 | 0.1846 | 0.1429 |
| microsoft_deberta_v3_small_AdamW_without_dropout_cls_results | funny | 0.3529 | 0.2147 | 0.2456 | 0.2235 |
| microsoft_deberta_v3_small_AdamW_without_dropout_cls_results | sad | 0.4958 | 0.2178 | 0.2161 | 0.2043 |
| microsoft_deberta_v3_small_AdamW_without_dropout_mean_pooling_results | funny | 0.3193 | 0.1647 | 0.1967 | 0.1449 |
| microsoft_deberta_v3_small_AdamW_without_dropout_mean_pooling_results | sad | 0.5546 | 0.2111 | 0.2062 | 0.1546 |
| microsoft_deberta_v3_small_SGD_with_dropout_cls_results | funny | 0.2857 | 0.1467 | 0.1862 | 0.1598 |
| microsoft_deberta_v3_small_SGD_with_dropout_cls_results | sad | 0.5546 | 0.2068 | 0.2253 | 0.1992 |
| microsoft_deberta_v3_small_SGD_with_dropout_mean_pooling_results | funny | 0.3193 | 0.1007 | 0.1899 | 0.1195 |
| microsoft_deberta_v3_small_SGD_with_dropout_mean_pooling_results | sad | 0.5462 | 0.1092 | 0.2 | 0.1413 |
| microsoft_deberta_v3_small_SGD_without_dropout_cls_results | funny | 0.2857 | 0.1471 | 0.1885 | 0.1599 |
| microsoft_deberta_v3_small_SGD_without_dropout_cls_results | sad | 0.395 | 0.1395 | 0.1637 | 0.1506 |
| microsoft_deberta_v3_small_SGD_without_dropout_mean_pooling_results | funny | 0.3613 | 0.2458 | 0.2188 | 0.1544 |
| microsoft_deberta_v3_small_SGD_without_dropout_mean_pooling_results | sad | 0.5546 | 0.1714 | 0.2094 | 0.1651 |
| gemini-2.5-flash-lite | funny | 0.2887 | 0.3006 | 0.2467 | 0.206 |
| gemini-2.5-flash-lite | sad | 0.4347 | 0.419 | 0.4644 | 0.404 |
| gemini-2.5-flash | funny | 0.1168 | 0.2769 | 0.2107 | 0.1022 |
| gemini-2.5-flash | sad | 0.4485 | 0.3771 | 0.444 | 0.3823 |
| gemini-2.5-pro | funny | 0.1718 | 0.2623 | 0.3017 | 0.1628 |
| gemini-2.5-pro | sad | 0.4296 | 0.3596 | 0.4208 | 0.3602 |
Among the Gemini models, Flash-Lite appears to have performed the best. It had the best sad metrics (precision of 0.42, recall of 0.46, and F1 of 0.40) and the best funny precision (0.30) and F1 (0.21) of the three. Flash and Pro had higher recall on funny for some classes but much lower accuracy because they overfitted and defaulted to predicting the same labels repeatedly. Overall, Flash-Lite’s balance made it the best choice for scene-level predictions.
Among the DeBERTa models, the base model trained with SGD without dropout, using the CLS token for encodings, performed best. It had the highest funny accuracy (0.39) and f1 (0.24), and while its sad accuracy (0.40) was not the highest, it was more balanced across rating classes than the other models. In contrast, many of the AdamW and mean pooling models ended up predicting the same class repeatedly, especially for funny predictions. Models with SGD but without dropout were more spread out across the 1-3 ratings classes. It seems that the CLS token approach performed well due to the size of scenes, and the noise introduced from averaging 100+ tokens.
It's also worth noting that the DeBERTa models as a whole did poorly on funny and sad labels of 4 and 5 (that is, those that are more funny or more sad), largely because the labeled data had relatively few scenes at those extremes. For example in Season 3, Episode 5, the hospital lawyer Ted loses a competition against the chief of medicine's dog Baxter to see who is smarter. The DeBERTa models incorrectly predicted this objectively hilarious scene as being 2/5 funny:
Dr. Kelso: Baxter, speak!
Baxter barks.
Dr. Kelso: Ted, speak!
Lawyer: Hellooooooooo!
Dr. Kelso: Baxter, left foot!
The dog raises its left paw.
Dr. Kelso: Ted, left hand!
Ted reflexively raises his right hand.
Elliot: Left hand, Ted.
Lawyer: Hellooooooooooo!
Dr. Kelso: Baxter wins!
In the same episode, in a scene I labeled as having a sad rating of 4, the main character's brother stands up for him behind his back, after having let him down previously. All the DeBERTa models predicted this scene as having a sadness rating of 1 or 2:
Hey, listen, Dr. Cox: No offense, I'm a big fan of the tough-guy act, but let me tell you what I really think. I think you love the fact that these kids idolize you. Johnny does! Johnny was always the one in the family we knew was going someplace -- sweet kid, smart kid. Becoming a doctor, this is all he ever wanted; and yet, somehow, you've found a way to beat that out of him, haven't you? Turned him into some cynical guy who seems to despise what he does Dr. Cox, Johnny's never gonna look up to me. Ever. But he hangs on your every word. So, I'm askin' -- I'm telling you -- take that responsibility seriously; stop being such a hard-ass, otherwise you're gonna have to answer to me.
The evaluate_models/evaluate.py takes in the prediction JSONs from data/DeBERTa_predictions/labeled_scenes/ and data/gemini_predictions/labeled_scenes/. Each file contains both the labels and the model's predictions per scene.
A summary table is saved to data/model_evaluation/summary.csv with the metrics discussed above.
Accuracy charts are saved for each model at data/model_evaluation/charts/. Each chart is a histogram displaying the share of predictions that were classified correctly, too high (overrating how funny or sad a scene was), too low (underrating how funny or sad a scene was), or not classified (in the case of Gemini).
Predictions and their corresponding scenes are bucketed into the different types of classifications (correct / too high / too low / not classified), and JSONs are produced for each bucket for each model in data/model_evaluation/examples/.
After evaluation I chose the best-performing model from each approach (DeBERTa-v3-base with SGD, no dropout, and CLS-token encodings, and Gemini 2.5 Flash-Lite) and ran them on all scraped transcripts so I could aggregate predicted funny/sad scores at the episode level for IMDb rating analysis.
The DeBERTa/predict_all_scenes_with_transcripts.py script loads the fine-tuned checkpoint and existing predictions for the labeled scenes. It then iterates over every scene in the transcripts folder (skipping scenes that already have predictions), runs the same encoding and classification process as in the fine-tuning notebook, and saves one JSON file with all scenes and their predicted funny and sad labels.
The gemini/predict_all_scenes_with_transcripts.py script uses the same two-shot prompt previously used in training with the Gemini 2.5 Flash-Lite model. It loads all scenes from the transcript directory, merges on existing predictions, skips any scenes with existing predictions, and prompts the model for the remaining scenes. It saves one JSON file with all scenes and their predicted funny and sad ratings.
Both scripts use the transcripts from data/transcripts/. DeBERTa uses existing predictions from data/DeBERTa_predictions/labeled_scenes/. Gemini uses existing predictions from data/gemini_predictions/labeled_scenes/.
DeBERTa predictions for all scenes are saved to data/DeBERTa_predictions/all_scenes_with_transcripts/. Gemini predictions for all scenes are saved to data/gemini_predictions/all_scenes_with_transcripts/.
The helper functions in model_IMDb_episode_ratings/helper_functions.py load the manual labels for ratings in 15 episodes, as well as DeBERTa predictions for ratings in 125 episodes. Next, the mean value and variance for funny and sad ratings are calculated across the scenes in each episode. We also compute what I call funny and sad change metrics: the mean absolute change in funny/sad ratings between consecutive scenes in an episode.
The helper functions also load the scraped distribution of episode ratings for each episode provided by IMDb users. From there, we calculate the mean and variance of ratings, as well as the share of ratings that are 1/10 (lowest rating) and 10/10 (highest rating).
Finally, these two sets of metrics are combined for a given episode.
The model_IMDb_episode_ratings/analysis.ipynb notebook uses the metrics described above to build dataframes with each episode represented as a row. It then runs correlations and OLS regressions to see the relationship between the funny/sad labels/predictions in explaining IMDb user ratings of episodes, using the season number and season opener/finale binaries as potential control variables. For the OLS regressions, I run different combinations of input features to see which models make most sense.
Since I only labeled 15 episodes, I just looked at correlations between the labels-derived features and IMDb variables instead of running OLS regressions. The variance in sadness feature has the strongest correlation with both the IMDb mean and the share of users who gave a perfect rating to the episode. This helps illustrate that Scrubs' peaks are driven by emotional depth and range, even if humor carries the show for most of the time. I would've expected that the change in sadness would have a stronger correlation with IMDb mean ratings and the share of perfect ratings, but thinking back on the episodes I labeled, some of the most emotional scenes lasted more than 150 words, which we define one individual scene. With that in mind, it makes sense that the variance in sadness would correlate more strongly with acclaimed episodes' ratings. IMDb rating variance and the share of 1/10 IMDb ratings are pretty correlated with one another, and neither seems to have a strong relationship with the sadness features. There is actually a measured negative correlation between the mean funny ratings and the share of 1/10 IMDb ratings, but I imagine this is due to my experience with show and preference for more niche, character-based humor. IMDb users who haven't seen the as many times might not pick up on the same jokes. Similarly, I might have undervalued some of the slapstick, shock-value type humor that was more common in the early 2000s and lost appeal with multiple watches.
| funny_mean | funny_var | funny_change | sad_mean | sad_var | sad_change | imdb_rating_mean | imdb_rating_variance | imdb_rating_share_1 | imdb_rating_share_10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| funny_mean | 1.000000 | 0.215088 | 0.540956 | -0.547356 | -0.604593 | -0.053846 | -0.563731 | 0.476696 | 0.546150 | -0.432608 |
| funny_var | 0.215088 | 1.000000 | 0.345746 | -0.156704 | 0.001333 | -0.085572 | 0.025670 | -0.172229 | -0.088742 | 0.022432 |
| funny_change | 0.540956 | 0.345746 | 1.000000 | -0.417797 | -0.365835 | -0.227030 | -0.483972 | 0.018353 | 0.138673 | -0.358063 |
| sad_mean | -0.547356 | -0.156704 | -0.417797 | 1.000000 | 0.821556 | 0.755440 | 0.531915 | 0.072685 | 0.022193 | 0.478064 |
| sad_var | -0.604593 | 0.001333 | -0.365835 | 0.821556 | 1.000000 | 0.436923 | 0.767410 | 0.043925 | -0.007594 | 0.691950 |
| sad_change | -0.053846 | -0.085572 | -0.227030 | 0.755440 | 0.436923 | 1.000000 | 0.243224 | 0.459907 | 0.441086 | 0.301661 |
| imdb_rating_mean | -0.563731 | 0.025670 | -0.483972 | 0.531915 | 0.767410 | 0.243224 | 1.000000 | 0.181802 | 0.105562 | 0.954352 |
| imdb_rating_variance | 0.476696 | -0.172229 | 0.018353 | 0.072685 | 0.043925 | 0.459907 | 0.181802 | 1.000000 | 0.922897 | 0.370464 |
| imdb_rating_share_1 | 0.546150 | -0.088742 | 0.138673 | 0.022193 | -0.007594 | 0.441086 | 0.105562 | 0.922897 | 1.000000 | 0.321885 |
| imdb_rating_share_10 | -0.432608 | 0.022432 | -0.358063 | 0.478064 | 0.691950 | 0.301661 | 0.954352 | 0.370464 | 0.321885 | 1.000000 |
As we saw with the data that I labeled, the regression results using our Gemini-generated ratings tell a pretty consistent story across different models. Sadness variance is the only feature that consistently appears as a predictor of IMDb mean rating and the share of IMDb ratings of 10, significant at the .01 level. In our model predicting the mean IMDb rating with funny/sad variances and without controls, a one-unit increase in sadness variance is associated with a 0.35 point increase in IMDb mean rating, holding funny variance constant.
The results around IMDb rating polarization are a little more interesting. Both the funny mean and sad mean features have significant positive associations with IMDb rating variance and the share of IMDb users who gave 1-star ratings. In the simple model predicting IMDb rating variance with funny/sad means without controls, a one-unit increase in funny mean is associated with a 1.09 point increase in IMDb rating variance, and a one-unit increase in sad mean is associated with a 0.78 point increase, both significant at the .01 level. Thinking about these results, it seems that some IMDb users enjoy when episodes are more intense and/or funny, while others prefer a more balanced pace to the episode.
Again, my oscillation variables do not perform well as predictors. Scene-to-scene funniness or sadness switching isn't significantly associated with any dependent variable. I think the variance does a better job of measuring the "gut punch" feeling, for the scene identification reasons discussed above.
Also, it's worth noting that even the best-performing models explain only about 17% or 18% of the variation in IMDb dependent variables
Here are the two most informative models:
OLS Regression Results
==============================================================================
Dep. Variable: imdb_rating_mean R-squared: 0.182
Model: OLS Adj. R-squared: 0.169
Method: Least Squares F-statistic: 13.57
Date: Tue, 10 Mar 2026 Prob (F-statistic): 4.77e-06
Time: 21:31:19 Log-Likelihood: -25.948
No. Observations: 125 AIC: 57.90
Df Residuals: 122 BIC: 66.38
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 7.5447 0.104 72.296 0.000 7.338 7.751
funny_var 0.2201 0.151 1.458 0.147 -0.079 0.519
sad_var 0.3454 0.090 3.849 0.000 0.168 0.523
==============================================================================
Omnibus: 33.970 Durbin-Watson: 1.564
Prob(Omnibus): 0.000 Jarque-Bera (JB): 54.577
Skew: 1.296 Prob(JB): 1.41e-12
Kurtosis: 4.939 Cond. No. 9.80
==============================================================================
================================================================================
Dep. Variable: imdb_rating_variance R-squared: 0.177
Model: OLS Adj. R-squared: 0.163
Method: Least Squares F-statistic: 13.08
Date: Tue, 10 Mar 2026 Prob (F-statistic): 7.16e-06
Time: 21:31:19 Log-Likelihood: -75.821
No. Observations: 125 AIC: 157.6
Df Residuals: 122 BIC: 166.1
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.8315 1.050 -1.745 0.084 -3.909 0.247
funny_mean 1.0935 0.259 4.216 0.000 0.580 1.607
sad_mean 0.7809 0.161 4.844 0.000 0.462 1.100
==============================================================================
Omnibus: 47.628 Durbin-Watson: 1.261
Prob(Omnibus): 0.000 Jarque-Bera (JB): 191.272
Skew: 1.276 Prob(JB): 2.92e-42
Kurtosis: 8.497 Cond. No. 106.
==============================================================================
Unfortunately, the results using DeBERTa predictions were pretty bleak. Most of our predictors had high p-values across all models, and the few models that did produce results significant at the .05 level came with very large standard errors, suggesting they were not meaningful. For example, in the model predicting mean IMDb rating with funny and sad variance, sad variance had a coefficient of 3.43 — nearly ten times the 0.35 we saw in the equivalent Gemini model, but with a standard error of 1.35. The results from the DeBERTa model are also largely inconsistent with the Gemini results. For example, the DeBERTa model predicting IMDb rating variance with funny and sad means shows funny mean negatively associated with rating variance, the opposite of what Gemini found. These factors, particularly the high p-values, suggest that the DeBERTa model performed more poorly than the Gemini model in correctly predicting how funny and sad scenes are.
The notebook and helper use data/imdb_episode_ratings.json, data/labeled_scenes.json, and the prediction JSONs in data/gemini_predictions/all_scenes_with_transcripts/ and data/DeBERTa_predictions/all_scenes_with_transcripts/.
All results are produced in the notebook itself.
