Currently, we just use the whole dataset, which can be
- A bit too much samples
- It's heavily biased against scores that aren't popular, but are more significant, such as 90%, in contrast to 99.5%, which doesn't say much
To even the training, we should try to uniformly sample across the sample space for a more representative measure and training process