Build a binary classifier to identify high-probability exoplanet candidates.
- NASA Open Data Portal: https://data.nasa.gov/dataset/kepler
- Kepler Exoplanet Search Results: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results/data
- Kepler Exoplanet Dataset: https://www.kaggle.com/datasets/gauravkumar2525/kepler-exoplanet-dataset/data
Choosing last one since it seems to be the most usable.
Columns
- kepid – Unique identifier for the host star.
- kepoi_name – Unique identifier for the planetary candidate.
- koi_disposition – Status of the exoplanet candidate (converted to numerical values):
"CANDIDATE" -> 1 (Potential exoplanet)
"CONFIRMED" -> 2 (Verified exoplanet)
"FALSE POSITIVE" -> 0 (Not a real exoplanet)
- koi_score – Confidence score for the planetary classification (higher values indicate stronger confidence).
- koi_period – Orbital period of the planet (in days).
- koi_prad – Estimated planetary radius (in Earth radii).
- koi_teq – Estimated equilibrium temperature of the planet (Kelvin).
- koi_insol – Insolation flux received by the planet (relative to Earth's insolation).
- koi_steff – Effective temperature of the host star (Kelvin).
- koi_srad – Stellar radius (in solar radii).
- koi_slogg – Surface gravity of the host star (logarithmic scale, in cm/s²).
- koi_kepmag – Kepler-band magnitude (brightness of the star as observed by Kepler).
This project is unique because the data is imbalanced. There are far more "False Positives" than actual "Confirmed" habitable planets.
Secondly, accuracy is a poor metric for this dataset, because we do not want to accurately predict each exoplanet. The class imbalance will make it worse since the model will learn to predict False Positive ~70% of time and have ~100% accuracy.
We will aim for high precision for this project. This will make sure it predicts confirmed only if there is a high probability. Since sending a probe to another star is expensive, we want a model with High Precision. We would rather miss a habitable planet than waste money investigating a rock
There is also a lot of data skew, since there are very few heavy and large bodies. We will use certain strategies such as log1p and scaling to mitigate these issues. These will normalize the skewed data points.
Taking a look at the data, we can see that
kepid kepoi_name koi_disposition koi_score koi_period koi_prad \
0 10797460 K00752.01 2 1.000 9.488036 2.26
1 10797460 K00752.02 2 0.969 54.418383 2.83
2 10811496 K00753.01 1 0.000 19.899140 14.60
3 10848459 K00754.01 0 0.000 1.736952 33.46
4 10854555 K00755.01 2 1.000 2.525592 2.75
koi_teq koi_insol koi_steff koi_srad koi_slogg koi_kepmag
0 793.0 93.59 5455.0 0.927 4.467 15.347
1 443.0 9.11 5455.0 0.927 4.467 15.347
2 638.0 39.30 5853.0 0.868 4.544 15.436
3 1395.0 891.96 5805.0 0.791 4.564 15.597
4 1406.0 926.16 6031.0 1.046 4.438 15.509Out of these, koi_score is something ML or predicted on it's own, and we must not include it in our model's training
process, as our model could learn to copy NASA's score rather then evaluating actual planetary data. It is called data
leakage. To fix it, we must drop this feature altogether.
Similarly, kepid and kepoi_name are totally not relevant to the classification task. They can be safely dropped.
Finally, koi_disposition is three classes, CANIDATE is a yes-no class, and training on ambiguous data may confuse
the algorithm.
Then, we can see that all columns are non-null and do contain full 9564 values, which is great, no missing values.
<class 'pandas.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 kepid 9564 non-null int64
1 kepoi_name 9564 non-null str
2 koi_disposition 9564 non-null int64
3 koi_score 9564 non-null float64
4 koi_period 9564 non-null float64
5 koi_prad 9564 non-null float64
6 koi_teq 9564 non-null float64
7 koi_insol 9564 non-null float64
8 koi_steff 9564 non-null float64
9 koi_srad 9564 non-null float64
10 koi_slogg 9564 non-null float64
11 koi_kepmag 9564 non-null float64
dtypes: float64(9), int64(2), str(1)
memory usage: 896.8 KB
NoneFinally there are outliers in the dataset, as they are in the space, but some of them might be telescopic glitches.
kepid koi_disposition koi_score koi_period \
count 9.564000e+03 9564.000000 9564.000000 9564.000000
mean 7.690628e+06 0.780845 0.480829 75.671358
std 2.653459e+06 0.863026 0.437658 1334.744046
min 7.574500e+05 0.000000 0.000000 0.241843
25% 5.556034e+06 0.000000 0.000000 2.733684
50% 7.906892e+06 0.000000 0.480829 9.752831
75% 9.873066e+06 2.000000 0.995000 40.715178
max 1.293514e+07 2.000000 1.000000 129995.778400
koi_prad koi_teq koi_insol koi_steff koi_srad \
count 9564.000000 9564.000000 9.564000e+03 9564.00000 9564.000000
mean 102.891778 1085.385828 7.745737e+03 5706.82328 1.728712
std 3018.662296 839.940895 1.565099e+05 781.58775 6.009769
min 0.080000 25.000000 0.000000e+00 2661.00000 0.109000
25% 1.430000 553.000000 2.216000e+01 5333.00000 0.835750
50% 2.490000 906.000000 1.583200e+02 5745.00000 1.006500
75% 21.712500 1352.500000 1.110257e+03 6099.00000 1.435250
max 200346.000000 14667.000000 1.094755e+07 15896.00000 229.908000
koi_slogg koi_kepmag
count 9564.000000 9564.000000
mean 4.310157 14.264606
std 0.424316 1.385376
min 0.047000 6.966000
25% 4.232750 13.440000
50% 4.432000 14.520000
75% 4.539000 15.322000
max 5.364000 20.003000The scales we can see in this dataset are absolutely massive, the 75%ile is basically 40 for koi_period, and the
maximum value is 1,29,995 days, wild! Same goes for koi_prad, the radius, 75%ile is again just 21 times Earth radii,
but maximum is 2,00,346 time Earth radii, which is erroneously incorrect. We need to scale or cap them using a scaler
that handles outliers well.
Then, after dropping the unneeded columns and also the CANDIDATE category, we can have this histogram of the dataset.
There are a lot of things that can be observed from these plots.
First of all, no distribution is around the center, i.e. perfect for training a model on. Almost all are skewed towards
left, with some being in the left entirely. koi_keepmag (Kepler band magnitude) can be said to be uniformly
distributed around the center, but rest are just not, (koi_prod, koi_insol, koi_srad and koi_period) This is
because almost all (~99%) data points are small in these values, and only some are large, or very large. These are
heavy-tailed distributions. We cannot use a simple StandardScaler as these large values will change mean and
variance, disturbing scaling for other radii. These need a mathematical transformation applied to them before scaling.
The class 0 data points are almost double the data points of class 1 in koi_disposition (the target feature,
probably), Roughly ~5000 vs ~2500. This is called Class Imbalance, simple using the train_test_split may cause
less data points of class 0 in the training set. We must use StratifiedShuffleSplit to preserve the same 2:1
ratio in both training and test sets. The reason for this is if our model turns out to be a dumb model predicting class
0 always, it might be correct ~66% of time (accuracy), but it is not right. That's why we will look into F1 Score
and Confusion Matrix.
Other features like koi_steff, koi_kepmag and koi_teq are much closer to standard bell curves, slightly skewed in
one direction, but okay since StandardScaler will fix these shortcomings.
The most common fix for these kind of problems in data science is to use the logarithmic scales (np.log1p()). It
shrinks massive outliers while spreading close together small values.
skewed_cols = ['koi_period', 'koi_prad', 'koi_insol', 'koi_srad']
for col in skewed_cols:
df[col] = np.log1p(df[col])This is easily fixed by maintaining the ratio in the splits as well. Use the basic StratifiedShuffleSplit.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=38)
for train_index, test_index in split.split(df, df['koi_disposition']):
df_train = df.iloc[train_index]
df_test = df.iloc[test_index]
print(f"Ratio: {df_cleaned['koi_disposition'].value_counts() / len(df_cleaned)}")
print(f"Train Ratio: {df_train['koi_disposition'].value_counts() / len(df_train)}")
print(f"Test Ratio: {df_test['koi_disposition'].value_counts() / len(df_test)}")Original Ratio: koi_disposition
0 0.638222
1 0.361778
Name: count, dtype: float64
Train Ratio: koi_disposition
0 0.638252
1 0.361748
Name: count, dtype: float64
Test Ratio: koi_disposition
0 0.638102
1 0.361898
Name: count, dtype: float64We should perform the train/test split as soon as possible, and here we do it after our EDA. This is to hide the test set from the model as well as from our eyes. This is to not introduce any bias for the test data, and it is better to keep it separate as early as possible.
After both, we have this histogram and correlation analysis.
koi_disposition 1.000000
koi_slogg 0.180683
koi_kepmag 0.057630
koi_period 0.057038
koi_srad -0.198354
koi_steff -0.220260
koi_insol -0.264089
koi_teq -0.272284
koi_prad -0.434116
Name: koi_disposition, dtype: float64Correlation tells that koi_prad is the most important feature as it is inversely proportional to a exoplanet being a
planet. Rest is moderately important and koi_kepmag and koi_period have no effect whatsoever. koi_slogg has slight
positive correlation.
Looks pretty nice, doesn't it?
The very first thing we need to do is to separate our Predictors (features)
kepler_train = train_set.drop("koi_disposition", axis=1)
kepler_labels = train_set["koi_disposition"].copy()Imputer is a thing that helps us fill in missing values of a dataset, with say median of all values. sklearn's
SimpleImputer helps.
Proper Scaling is another important thing. koi_steff goes upto 15896, but koi_slogg max is 5.36 only. ML
Algorithms think larger numbers are more important, therefore we must use something like a StandardScaler. It shifts
the data so that mean is 0 and std deviation is 1.
Now with that, doing these tasks one by one in a single notebook or a python script is okay, but if we need to do it
again, it becomes painful. We resort to copy pasting bits over. sklearn's pipelines come into play at this point.
These can transform the data based on predefined steps with a single .fit_transform(...) call.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# This pipeline executes our cleaning steps in order
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")), # Fill missing values
('std_scaler', StandardScaler()), # Scale the numbers
])
kepler_prepared = num_pipeline.fit_transform(kepler_train)
print("Data shape after preparation:", kepler_prepared.shape)
print("First row of prepared data:\n", kepler_prepared[0])After applying the transformation, we get back a numpy ndarray, superb for speedy calculations and model training, and but incompatible with human eyes. We need to see a histogram of the scaled data, and need to convert it back to DataFrame.
Mostly looks the same, but data is rescaled now. The mean is 0, and standard deviation is 1. Now we can move onto the training part
Instead of starting with a very strong model and calling it a day, we will start with a baseline model, such as SGD (Stochastic Gradient Descent) Classifier instead of something powerful like Random Forest.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)In this problem, where huge class imbalance exists, accuracy would be a terrible score to consider for evaluation. The model might learn to always predict false position and be correct 76% of time. So we will use Cross-Validation. We will split training set into 3 chunks, training on 2 and prediction on 1 it hasn't seen yet.
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, kepler_prepared, kepler_labels, cv=3)And then analyze the core metrics.
print("Confusion Matrix:\n", confusion_matrix(y_train, y_train_pred))
print(f"Precision: {precision_score(y_train, y_train_pred) * 100:.2f}%")
print(f"Recall: {recall_score(y_train, y_train_pred) * 100:.2f}%")
print(f"F1 Score: {f1_score(y_train, y_train_pred) * 100:.2f}%")Which are, for the random_state of 32198749...
Confusion Matrix:
[[3065 806]
[ 404 1790]]
Precision: 68.95%
Recall: 81.59%
F1 Score: 74.74%The confusion matrix turned out to be
[[3065 806]
[ 404 1790]]- True Negatives: 3065 -> Correctly ignored 3065 False Positives.
- False Positives: 806 -> Incorrectly predicted 806 True Negatives as False Positives. This is why precision is 68%, 3 out of 10 planets it's predictions are not confirmed.
- False Negatives: 404 -> Incorrectly predicted 404 False Positives as True Negatives. The model missed 404 confirmed planets. That's why recall is 81%, it found 81% of all real planets.
- True Positives: 1790 -> Correctly predicted 1790 True Positives.
Since we cannot have both high precision and high recall, we need to choose what matters most to this problem. Using the James Webb Telescope as a NASA Scientist can get very expensive, so we cannot waste money pointing at 806 fake planets. Therefore, for Kepler-Zero, we want High Precision.
Now we may try to further improve the model by tweaking the Decision Threshold of the SGD Classifier.
What this model is doing is actually calculating a score for each plant, if > 0, it's Confirmed (1) if < 0, it's Ignored (0). We will shift this threshold to better fit our project's goals.
Here, we found the threshold where precision is above 80%, but the cost we payed with recall was that it dropped to 4%, which is pretty bad. This is unfortunately the limit of linear models, and we must try a non-linear model like RandomForest now. SGD Classifier drew a straight to separate the classes, which is not ideal for our problem. Random Forest is an ensemble of many Decision Trees. It can capture complex patterns in the data, which is not possible with a linear model.
forest_clf = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
forest_clf.fit(X_train, y_train)And then the evaluation part.
# Cross validated prediction
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method="predict_proba")
# Random Forest uses probabilities, take the score of the positive class (1)
y_scores_forest = y_probas_forest[:, 1]
y_train_pred_forest = y_scores_forest >= 0.5
print("--- Random Forest Performance ---")
print("Confusion Matrix:\n", confusion_matrix(y_train, y_train_pred_forest))
print(f"Precision: {precision_score(y_train, y_train_pred_forest) * 100:.2f}%")
print(f"Recall: {recall_score(y_train, y_train_pred_forest) * 100:.2f}%")
print(f"F1 Score: {f1_score(y_train, y_train_pred_forest) * 100:.2f}%")Which is far better than basic SGDClassifier
--- Random Forest Performance ---
Confusion Matrix:
[[3414 457]
[ 359 1835]]
Precision: 80.06%
Recall: 83.64%
F1 Score: 81.81%The most important features (k=3)
Top 3 Features for Kepler-Zero:
koi_prad 0.327731
koi_period 0.138250
koi_insol 0.131683
dtype: float64Precision Recall v/s Threshold Curve
There is a mark at the 90% precision mark, which we want to achieve. The threshold would be 0.77 here, which gives a precision of 90.43% and Recall of 56.01%, which is acceptable. F1 score is 69.18%
Around 20% of the data was reserved at the start for testing. After building the pipeline and tuning the Random Forest on the training data, a single final evaluation was performed on this 'unseen' set. The model achieved 91.71% Precision, proving that it is ready for deployment in the Kepler-Zero API.
The final Model Metrics on the test dataset (it has never seen before)
--- Test set performance metrics ---
Test Precision: 91.71%
Test Recall: 62.48%
Test F1 Score: 74.32%Which is not bad! It's great. This means it can be proudly said that "When the system flags a planet as confirmed, it has a 91.71% chance that it is correct". This however has a trade-off that it would only say an exoplanet as correct about 62.48% of time, but at-least we are not wasting resources on things that are not planets.
Overall F1 Score of 74.32% is great!
Now we need to save the model, the data pipeline and load it back somehow on the FastAPI server to serve requests. We can use joblib for this task.
We serialize everything needed for inference into a single file:
- model
- preprocessing pipeline
- feature metadata
- threshold + metrics
joblib.dump(
{
"skewed_cols": skewed_cols,
"pipeline": num_pipeline,
"model": forest_clf,
"precision_target": target_precision,
"precision_threshold": threshold_90_rf,
"precision_score": precision_score(y_train, y_train_pred_90_forest),
"recall_score": recall_score(y_train, y_train_pred_90_forest),
"f1_score": f1_score(y_train, y_train_pred_90_forest),
},
"./api/model/kepler_zero_format.joblib",
)The API is just a thin wrapper around the saved model.
- load .joblib once
- keep model + pipeline in memory
- validate input (Pydantic)
- apply same preprocessing as training
- run pipeline → model
- apply threshold
- return result
- basic sanity check + model metadata
Request Flow
request → validate → preprocess → pipeline → model → threshold → response
Read more about it here
- Do feature engineering
- planet_star_ratio = koi_prad / koi_srad
- temp_insol_ratio = koi_teq / (koi_insol + 1)





