Why are the many noise (outliers)?

Thank you for sharing it with community great tool and I would say it is UMAP+HDBSCAN on steroids!

Quick question though, when I try to cluster 30k of text embeddings, I am getting a lot of the texts being grouped as outliers. I have tried to change params like noise_level, base_min_cluster_size or min_number_clusters, about 10-15% of the population is outlier, If I run UMAP+HDBSCAN manually I get significantly low number of outliers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are the many noise (outliers)? #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why are the many noise (outliers)? #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions