Skip to content

add summary feature nodedefinition#797

Merged
sevmag merged 16 commits intographnet-team:mainfrom
sevmag:summary_feature_node
Jun 6, 2025
Merged

add summary feature nodedefinition#797
sevmag merged 16 commits intographnet-team:mainfrom
sevmag:summary_feature_node

Conversation

@sevmag
Copy link
Copy Markdown
Collaborator

@sevmag sevmag commented Apr 30, 2025

Adding a new NodeDefiniton called DOMSummaryFeatures which is a per DOM summary statistics NodeDefiniton used by different architectures in IceCube. This exact version is taken from Theo Glauchs thesis (https://mediatum.ub.tum.de/node?id=1584755). The reason for this PR is because I heard of other people also using/interested in these type of features.

Features of the DOMSummaryFeatures

- total charge
- charge accumulated after times `charge_after_t`
- time of first hit per DOM
- time spread per DOM
- time std per DOM
- time took to collect the charge percentiles `time_after_charge_pct`
- number of pulses per DOM

Here is a visual representation of the feature (more details are in Theos thesis above)

summary_features

And the numerical ranges of these features:

summary_features_numerical_ranges

NOTE: I had to change the clustering utility implemented by @Aske-Rosted in #770 (Therefore the review request) .

  • I added functionality for sorting the clustered table by an additional feature
  • changed some of the functions (NOTE: this functions are not being used in default GN atm if I'm not mistaken)
    • add_charge_threshold_summary (changed it so it does not alter the _padded_x attribute anymore)
    • _calculate_charge_sum (change to deal with nans)
    • added the necessary functions for my features

I also did timing tests:

node_timing

And some ratios:

timing_ratios_nodes

Here the NodeDefinitons:

input_feature_names = ['dom_x', 'dom_y', 'dom_z', 'dom_time', 'charge']
nodes = {
    'nodes_as_pulses': NodesAsPulses(
        input_feature_names=input_feature_names,
    ),
    'prct_clust_5pcts' : PercentileClusters(
        input_feature_names=input_feature_names,
        cluster_on = ['dom_x', 'dom_y', 'dom_z'],
        percentiles=np.linspace(0.2, 1.0, 5),
    ),
    'prct_clust_10pcts' : PercentileClusters(
        input_feature_names=input_feature_names,
        cluster_on = ['dom_x', 'dom_y', 'dom_z'],
        percentiles=np.linspace(0.1, 1.0, 10),
    ),
    'prct_clust_20pcts' : PercentileClusters(
        input_feature_names=input_feature_names,
        cluster_on = ['dom_x', 'dom_y', 'dom_z'],
        percentiles=np.linspace(0.05, 1.0, 20),
    ),
    'ice_mix_max512': IceMixNodes(
        input_feature_names=input_feature_names,
        max_pulses=512,
        add_ice_properties=False,
        sample_pulses=True,
    ),
    'ice_mix_max1024': IceMixNodes(
        input_feature_names=input_feature_names,
        max_pulses=1024,
        add_ice_properties=False,
        sample_pulses=True,
    ),
    'ice_mix_max2048': IceMixNodes(
        input_feature_names=input_feature_names,
        max_pulses=2048,
        add_ice_properties=False,
        sample_pulses=True,
    ),
    'dom_summary': DOMSummaryFeatures(
        input_feature_names=input_feature_names,
        cluster_on = ['dom_x', 'dom_y', 'dom_z'],
        
    ),
}

@sevmag sevmag requested a review from Aske-Rosted April 30, 2025 15:43
time_standardization: float = 1e-3,
add_counts: bool = False,
) -> None:
"""Construct `PercentileClusters`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference to wrong class in docstring

cluster_class.add_counts()
return torch.tensor(cluster_class.clustered_x)

def set_indeces(self, feature_names: List[str]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - indices

return graph


class DOMSummaryFeatures(NodeDefinition):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you point out in the docstring, the method may aggregate information across any defined cluster. To avoid confusion, I think the name of the class should reflect this. It could for example be ClusterSummaryTheo :-)



class DOMSummaryFeatures(NodeDefinition):
"""Represent DOMs as clusters with summary features.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just for DOMs!

@sevmag sevmag force-pushed the summary_feature_node branch from ab66a04 to cfcbe8c Compare May 28, 2025 09:40
@sevmag
Copy link
Copy Markdown
Collaborator Author

sevmag commented May 28, 2025

@Aske-Rosted and I talked about some functionality and wanted to change a few small things.

  • Before ClusterSummaryFeatures enforced the sorting of the data per cluster by time. @Aske-Rosted requested to make this optional since sometimes the data is already sorted in the database, which then would create an unnecessary overhead when sorting again (New class init argument and got rid of assert in the cluster_and_pad class) I did not make any timing checks for that.
  • Possibility for logarithmic standardisation of the charge features (added functionality to handle both logarithmic and scalar multiplication standardisation)

See new feature ranges:

feature_ranges

Copy link
Copy Markdown
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two very minor things that you might consider, but otherwise good to go. Looking forward to using it on my end.

)

def _calculate_reference_time(self, time_index: int) -> np.ndarray:
"""Calculate the charge weighted median time of the whole event."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably say of the node.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I do not understand what you mean by node, but I do not agree. The reference time is a weighted median of all pulses in the event, not just those at a single cluster/node. It is important to have a per-event reference point to have the relative time information between the clusters. Am I missing something?

or inserted at the specified location
"""
# Summarize the values at different times
time_first_pulse = np.nanmin(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be made to a class feature since it is used in several of the class functions.

Copy link
Copy Markdown
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two very minor things that you might consider, but otherwise good to go. Looking forward to using it on my end.

@sevmag sevmag force-pushed the summary_feature_node branch from dbef537 to 27800c4 Compare June 1, 2025 20:17
@sevmag sevmag merged commit 9b1d49c into graphnet-team:main Jun 6, 2025
13 of 14 checks passed
@sevmag sevmag mentioned this pull request Nov 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants