Kepler-Zero: Exoplanet Classifier

PROJECT GOAL

Build a binary classifier to identify high-probability exoplanet candidates.

Data Sources

NASA Open Data Portal: https://data.nasa.gov/dataset/kepler
Kepler Exoplanet Search Results: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results/data
Kepler Exoplanet Dataset: https://www.kaggle.com/datasets/gauravkumar2525/kepler-exoplanet-dataset/data

Choosing last one since it seems to be the most usable.

About The Data

Columns

kepid – Unique identifier for the host star.
kepoi_name – Unique identifier for the planetary candidate.
koi_disposition – Status of the exoplanet candidate (converted to numerical values):

  "CANDIDATE" -> 1 (Potential exoplanet)
  "CONFIRMED" -> 2 (Verified exoplanet)
  "FALSE POSITIVE" -> 0 (Not a real exoplanet)

koi_score – Confidence score for the planetary classification (higher values indicate stronger confidence).
koi_period – Orbital period of the planet (in days).
koi_prad – Estimated planetary radius (in Earth radii).
koi_teq – Estimated equilibrium temperature of the planet (Kelvin).
koi_insol – Insolation flux received by the planet (relative to Earth's insolation).
koi_steff – Effective temperature of the host star (Kelvin).
koi_srad – Stellar radius (in solar radii).
koi_slogg – Surface gravity of the host star (logarithmic scale, in cm/s²).
koi_kepmag – Kepler-band magnitude (brightness of the star as observed by Kepler).

Phase 1: The Challenge

This project is unique because the data is imbalanced. There are far more "False Positives" than actual "Confirmed" habitable planets.

Secondly, accuracy is a poor metric for this dataset, because we do not want to accurately predict each exoplanet. The class imbalance will make it worse since the model will learn to predict False Positive ~70% of time and have ~100% accuracy.

We will aim for high precision for this project. This will make sure it predicts confirmed only if there is a high probability. Since sending a probe to another star is expensive, we want a model with High Precision. We would rather miss a habitable planet than waste money investigating a rock

There is also a lot of data skew, since there are very few heavy and large bodies. We will use certain strategies such as log1p and scaling to mitigate these issues. These will normalize the skewed data points.

Phase 2: Exploratory Data Analysis

Taking a look at the data, we can see that

      kepid kepoi_name  koi_disposition  koi_score  koi_period  koi_prad  \
0  10797460  K00752.01                2      1.000    9.488036      2.26
1  10797460  K00752.02                2      0.969   54.418383      2.83
2  10811496  K00753.01                1      0.000   19.899140     14.60
3  10848459  K00754.01                0      0.000    1.736952     33.46
4  10854555  K00755.01                2      1.000    2.525592      2.75
   koi_teq  koi_insol  koi_steff  koi_srad  koi_slogg  koi_kepmag
0    793.0      93.59     5455.0     0.927      4.467      15.347
1    443.0       9.11     5455.0     0.927      4.467      15.347
2    638.0      39.30     5853.0     0.868      4.544      15.436
3   1395.0     891.96     5805.0     0.791      4.564      15.597
4   1406.0     926.16     6031.0     1.046      4.438      15.509

Out of these, koi_score is something ML or predicted on it's own, and we must not include it in our model's training process, as our model could learn to copy NASA's score rather then evaluating actual planetary data. It is called data leakage. To fix it, we must drop this feature altogether.

Similarly, kepid and kepoi_name are totally not relevant to the classification task. They can be safely dropped. Finally, koi_disposition is three classes, CANIDATE is a yes-no class, and training on ambiguous data may confuse the algorithm.

Then, we can see that all columns are non-null and do contain full 9564 values, which is great, no missing values.

<class 'pandas.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   kepid            9564 non-null   int64
 1   kepoi_name       9564 non-null   str
 2   koi_disposition  9564 non-null   int64
 3   koi_score        9564 non-null   float64
 4   koi_period       9564 non-null   float64
 5   koi_prad         9564 non-null   float64
 6   koi_teq          9564 non-null   float64
 7   koi_insol        9564 non-null   float64
 8   koi_steff        9564 non-null   float64
 9   koi_srad         9564 non-null   float64
 10  koi_slogg        9564 non-null   float64
 11  koi_kepmag       9564 non-null   float64
dtypes: float64(9), int64(2), str(1)
memory usage: 896.8 KB
None

Finally there are outliers in the dataset, as they are in the space, but some of them might be telescopic glitches.

              kepid  koi_disposition    koi_score     koi_period  \
count  9.564000e+03      9564.000000  9564.000000    9564.000000
mean   7.690628e+06         0.780845     0.480829      75.671358
std    2.653459e+06         0.863026     0.437658    1334.744046
min    7.574500e+05         0.000000     0.000000       0.241843
25%    5.556034e+06         0.000000     0.000000       2.733684
50%    7.906892e+06         0.000000     0.480829       9.752831
75%    9.873066e+06         2.000000     0.995000      40.715178
max    1.293514e+07         2.000000     1.000000  129995.778400
            koi_prad       koi_teq     koi_insol    koi_steff     koi_srad  \
count    9564.000000   9564.000000  9.564000e+03   9564.00000  9564.000000
mean      102.891778   1085.385828  7.745737e+03   5706.82328     1.728712
std      3018.662296    839.940895  1.565099e+05    781.58775     6.009769
min         0.080000     25.000000  0.000000e+00   2661.00000     0.109000
25%         1.430000    553.000000  2.216000e+01   5333.00000     0.835750
50%         2.490000    906.000000  1.583200e+02   5745.00000     1.006500
75%        21.712500   1352.500000  1.110257e+03   6099.00000     1.435250
max    200346.000000  14667.000000  1.094755e+07  15896.00000   229.908000
         koi_slogg   koi_kepmag
count  9564.000000  9564.000000
mean      4.310157    14.264606
std       0.424316     1.385376
min       0.047000     6.966000
25%       4.232750    13.440000
50%       4.432000    14.520000
75%       4.539000    15.322000
max       5.364000    20.003000

The scales we can see in this dataset are absolutely massive, the 75%ile is basically 40 for koi_period, and the maximum value is 1,29,995 days, wild! Same goes for koi_prad, the radius, 75%ile is again just 21 times Earth radii, but maximum is 2,00,346 time Earth radii, which is erroneously incorrect. We need to scale or cap them using a scaler that handles outliers well.

Histogram

Then, after dropping the unneeded columns and also the CANDIDATE category, we can have this histogram of the dataset.

There are a lot of things that can be observed from these plots.

Uneven Distributions

First of all, no distribution is around the center, i.e. perfect for training a model on. Almost all are skewed towards left, with some being in the left entirely. koi_keepmag (Kepler band magnitude) can be said to be uniformly distributed around the center, but rest are just not, (koi_prod, koi_insol, koi_srad and koi_period) This is because almost all (~99%) data points are small in these values, and only some are large, or very large. These are heavy-tailed distributions. We cannot use a simple StandardScaler as these large values will change mean and variance, disturbing scaling for other radii. These need a mathematical transformation applied to them before scaling.

Class Imbalance

The class 0 data points are almost double the data points of class 1 in koi_disposition (the target feature, probably), Roughly ~5000 vs ~2500. This is called Class Imbalance, simple using the train_test_split may cause less data points of class 0 in the training set. We must use StratifiedShuffleSplit to preserve the same 2:1 ratio in both training and test sets. The reason for this is if our model turns out to be a dumb model predicting class 0 always, it might be correct ~66% of time (accuracy), but it is not right. That's why we will look into F1 Score and Confusion Matrix.

Well-Behaved Bell-Curve distributions

Other features like koi_steff, koi_kepmag and koi_teq are much closer to standard bell curves, slightly skewed in one direction, but okay since StandardScaler will fix these shortcomings.

Fix: Uneven Distributions

The most common fix for these kind of problems in data science is to use the logarithmic scales (np.log1p()). It shrinks massive outliers while spreading close together small values.

skewed_cols = ['koi_period', 'koi_prad', 'koi_insol', 'koi_srad']
for col in skewed_cols:
	df[col] = np.log1p(df[col])

Fix: Class Imbalance

This is easily fixed by maintaining the ratio in the splits as well. Use the basic StratifiedShuffleSplit.

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=38)

for train_index, test_index in split.split(df, df['koi_disposition']):
	df_train = df.iloc[train_index]
	df_test = df.iloc[test_index]

print(f"Ratio: {df_cleaned['koi_disposition'].value_counts() / len(df_cleaned)}")
print(f"Train Ratio: {df_train['koi_disposition'].value_counts() / len(df_train)}")
print(f"Test Ratio: {df_test['koi_disposition'].value_counts() / len(df_test)}")

Original Ratio: koi_disposition
0    0.638222
1    0.361778
Name: count, dtype: float64
Train Ratio: koi_disposition
0    0.638252
1    0.361748
Name: count, dtype: float64
Test Ratio: koi_disposition
0    0.638102
1    0.361898
Name: count, dtype: float64

We should perform the train/test split as soon as possible, and here we do it after our EDA. This is to hide the test set from the model as well as from our eyes. This is to not introduce any bias for the test data, and it is better to keep it separate as early as possible.

After both, we have this histogram and correlation analysis.

Correlation Analysis

koi_disposition    1.000000
koi_slogg          0.180683
koi_kepmag         0.057630
koi_period         0.057038
koi_srad          -0.198354
koi_steff         -0.220260
koi_insol         -0.264089
koi_teq           -0.272284
koi_prad          -0.434116
Name: koi_disposition, dtype: float64

Correlation tells that koi_prad is the most important feature as it is inversely proportional to a exoplanet being a planet. Rest is moderately important and koi_kepmag and koi_period have no effect whatsoever. koi_slogg has slight positive correlation.

Looks pretty nice, doesn't it?

Phase 3: Preparing Data for Machine Learning

The very first thing we need to do is to separate our Predictors (features) $X$ from our Target $y$. We will perform transformations on our Predictors only.

kepler_train = train_set.drop("koi_disposition", axis=1)
kepler_labels = train_set["koi_disposition"].copy()

Imputer is a thing that helps us fill in missing values of a dataset, with say median of all values. sklearn's SimpleImputer helps.

Proper Scaling is another important thing. koi_steff goes upto 15896, but koi_slogg max is 5.36 only. ML Algorithms think larger numbers are more important, therefore we must use something like a StandardScaler. It shifts the data so that mean is 0 and std deviation is 1.

Now with that, doing these tasks one by one in a single notebook or a python script is okay, but if we need to do it again, it becomes painful. We resort to copy pasting bits over. sklearn's pipelines come into play at this point. These can transform the data based on predefined steps with a single .fit_transform(...) call.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# This pipeline executes our cleaning steps in order
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")), # Fill missing values
    ('std_scaler', StandardScaler()),              # Scale the numbers
])

kepler_prepared = num_pipeline.fit_transform(kepler_train)

print("Data shape after preparation:", kepler_prepared.shape)
print("First row of prepared data:\n", kepler_prepared[0])

After applying the transformation, we get back a numpy ndarray, superb for speedy calculations and model training, and but incompatible with human eyes. We need to see a histogram of the scaled data, and need to convert it back to DataFrame.

Mostly looks the same, but data is rescaled now. The mean is 0, and standard deviation is 1. Now we can move onto the training part

Phase 4: Model Training and Metric Analysis

Instead of starting with a very strong model and calling it a day, we will start with a baseline model, such as SGD (Stochastic Gradient Descent) Classifier instead of something powerful like Random Forest.

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)

In this problem, where huge class imbalance exists, accuracy would be a terrible score to consider for evaluation. The model might learn to always predict false position and be correct 76% of time. So we will use Cross-Validation. We will split training set into 3 chunks, training on 2 and prediction on 1 it hasn't seen yet.

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, kepler_prepared, kepler_labels, cv=3)

And then analyze the core metrics.

print("Confusion Matrix:\n", confusion_matrix(y_train, y_train_pred))
print(f"Precision: {precision_score(y_train, y_train_pred) * 100:.2f}%")
print(f"Recall: {recall_score(y_train, y_train_pred) * 100:.2f}%")
print(f"F1 Score: {f1_score(y_train, y_train_pred) * 100:.2f}%")

Which are, for the random_state of 32198749...

Confusion Matrix:
 [[3065  806]
 [ 404 1790]]
Precision: 68.95%
Recall: 81.59%
F1 Score: 74.74%

The confusion matrix turned out to be

[[3065   806]
 [ 404 1790]]

True Negatives: 3065 -> Correctly ignored 3065 False Positives.
False Positives: 806 -> Incorrectly predicted 806 True Negatives as False Positives. This is why precision is 68%, 3 out of 10 planets it's predictions are not confirmed.
False Negatives: 404 -> Incorrectly predicted 404 False Positives as True Negatives. The model missed 404 confirmed planets. That's why recall is 81%, it found 81% of all real planets.
True Positives: 1790 -> Correctly predicted 1790 True Positives.

Since we cannot have both high precision and high recall, we need to choose what matters most to this problem. Using the James Webb Telescope as a NASA Scientist can get very expensive, so we cannot waste money pointing at 806 fake planets. Therefore, for Kepler-Zero, we want High Precision.

Now we may try to further improve the model by tweaking the Decision Threshold of the SGD Classifier.

What this model is doing is actually calculating a score for each plant, if > 0, it's Confirmed (1) if < 0, it's Ignored (0). We will shift this threshold to better fit our project's goals.

Here, we found the threshold where precision is above 80%, but the cost we payed with recall was that it dropped to 4%, which is pretty bad. This is unfortunately the limit of linear models, and we must try a non-linear model like RandomForest now. SGD Classifier drew a straight to separate the classes, which is not ideal for our problem. Random Forest is an ensemble of many Decision Trees. It can capture complex patterns in the data, which is not possible with a linear model.

Enter the Random Forest

forest_clf = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)

forest_clf.fit(X_train, y_train)

And then the evaluation part.

# Cross validated prediction
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method="predict_proba")

# Random Forest uses probabilities, take the score of the positive class (1)
y_scores_forest = y_probas_forest[:, 1]
y_train_pred_forest = y_scores_forest >= 0.5

print("--- Random Forest Performance ---")
print("Confusion Matrix:\n", confusion_matrix(y_train, y_train_pred_forest))
print(f"Precision: {precision_score(y_train, y_train_pred_forest) * 100:.2f}%")
print(f"Recall: {recall_score(y_train, y_train_pred_forest) * 100:.2f}%")
print(f"F1 Score: {f1_score(y_train, y_train_pred_forest) * 100:.2f}%")

Which is far better than basic SGDClassifier

--- Random Forest Performance ---
Confusion Matrix:
 [[3414  457]
 [ 359 1835]]
Precision: 80.06%
Recall: 83.64%
F1 Score: 81.81%

The most important features (k=3)

Top 3 Features for Kepler-Zero:
koi_prad      0.327731
koi_period    0.138250
koi_insol     0.131683
dtype: float64

Precision Recall v/s Threshold Curve

There is a mark at the 90% precision mark, which we want to achieve. The threshold would be 0.77 here, which gives a precision of 90.43% and Recall of 56.01%, which is acceptable. F1 score is 69.18%

Final Metrics

Around 20% of the data was reserved at the start for testing. After building the pipeline and tuning the Random Forest on the training data, a single final evaluation was performed on this 'unseen' set. The model achieved 91.71% Precision, proving that it is ready for deployment in the Kepler-Zero API.

The final Model Metrics on the test dataset (it has never seen before)

--- Test set performance metrics ---
Test Precision: 91.71%
Test Recall: 62.48%
Test F1 Score: 74.32%

Which is not bad! It's great. This means it can be proudly said that "When the system flags a planet as confirmed, it has a 91.71% chance that it is correct". This however has a trade-off that it would only say an exoplanet as correct about 62.48% of time, but at-least we are not wasting resources on things that are not planets.

Overall F1 Score of 74.32% is great!

Phase 5: Persistence and FastAPI (Demo)

Now we need to save the model, the data pipeline and load it back somehow on the FastAPI server to serve requests. We can use joblib for this task.

We serialize everything needed for inference into a single file:

model
preprocessing pipeline
feature metadata
threshold + metrics

joblib.dump(
    {
        "skewed_cols": skewed_cols,
        "pipeline": num_pipeline,
        "model": forest_clf,
        "precision_target": target_precision,
        "precision_threshold": threshold_90_rf,
        "precision_score": precision_score(y_train, y_train_pred_90_forest),
        "recall_score": recall_score(y_train, y_train_pred_90_forest),
        "f1_score": f1_score(y_train, y_train_pred_90_forest),
    },
    "./api/model/kepler_zero_format.joblib",
)

FastAPI Service

The API is just a thin wrapper around the saved model.

Startup

load .joblib once
keep model + pipeline in memory

/predict

validate input (Pydantic)
apply same preprocessing as training
run pipeline → model
apply threshold
return result

/health

basic sanity check + model metadata

Request Flow

request → validate → preprocess → pipeline → model → threshold → response

Future Tasks

Do feature engineering
- planet_star_ratio = koi_prad / koi_srad
- temp_insol_ratio = koi_teq / (koi_insol + 1)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
api		api
assets		assets
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
build_image.sh		build_image.sh
compose.yml		compose.yml
get_data.sh		get_data.sh
main.ipynb		main.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kepler-Zero: Exoplanet Classifier

PROJECT GOAL

Data Sources

About The Data

Phase 1: The Challenge

Phase 2: Exploratory Data Analysis

Histogram

Uneven Distributions

Class Imbalance

Well-Behaved Bell-Curve distributions

Fix: Uneven Distributions

Fix: Class Imbalance

Correlation Analysis

Phase 3: Preparing Data for Machine Learning

Phase 4: Model Training and Metric Analysis

Enter the Random Forest

Final Metrics

Phase 5: Persistence and FastAPI (Demo)

FastAPI Service

Startup

/predict

/health

Future Tasks

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kepler-Zero: Exoplanet Classifier

PROJECT GOAL

Data Sources

About The Data

Phase 1: The Challenge

Phase 2: Exploratory Data Analysis

Histogram

Uneven Distributions

Class Imbalance

Well-Behaved Bell-Curve distributions

Fix: Uneven Distributions

Fix: Class Imbalance

Correlation Analysis

Phase 3: Preparing Data for Machine Learning

Phase 4: Model Training and Metric Analysis

Enter the Random Forest

Final Metrics

Phase 5: Persistence and FastAPI (Demo)

FastAPI Service

Startup

/predict

/health

Future Tasks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages