Data

Data taking

Data processing follows a tier-based approach, where initial filtering for particle interaction-related photon patterns (triggering of photon “hits”) serves to create data at a first event-based data level. In a second step, processing of the events, applying calibration, particle reconstruction and data analysis methods leads to enhanced data sets, requiring a high-performance computing infrastructure for flexible application of modern data processing and data mining techniques.

For physics analyses, derivatives of these enriched data sets are generated and their information is reduced to low-volume high-level data which can be analysed and integrated locally into the analysis workflow of the scientist. For interpretability of the data, a full Monte Carlo simulation of the data generation and processing chain, starting at the primary data level, is run to generate reference simulated data for cross-checks at all processing stages and for statistic interpretation of the particle measurements.

Overview over data levels

Event data processing

Photon-related information is written to ROOT-based tree-like data structures and accumulated during a predefined data taking time range of usually several hours (so-called data runs) before being transferred to high-performance computing (HPC) clusters.

Processed event data sets at the second level represent input to physics analyses, e.g. regarding neutrino oscillation and particle properties, and studies of atmospheric and cosmic neutrino generation. Enriching the data to this end involves probabilistic interpretation of temporal and spatial photon distributions for the reconstruction of event properties in both measured and simulated data, and requires high-performance computing capabilities.

Access to data at this level is restricted to collaboration members due to the intense use of computing resources, the large volume and complexity of the data and the members' primary exploitation right of KM3NeT data. However, data at this stage is already converted to HDF5 format as a less customized hierarchical format. This format choice increases interoperability and facilitates the application of data analysis software packages used e.g. in machine learning and helps to pave the way to wider collaborations within the scientific community utilizing KM3NeT data.

High level data and data derivatives

Summary formats and high-level data

As mostly information on particle type, properties and direction is relevant for the majority of physics analyses, a high-level summary format has been designed to reduce the complex event information to simplified arrays which allow for easy representation of an event data set as a table-like data structure. Although this already leads to a reduced data volume, these neutrino data sets are still dominated by atmospheric muon events at a ratio of about $10^{6} :1$. Since, for many analyses, atmospheric muons are considered background events to both astrophysics and oscillation studies, publication of low-volume general-purpose neutrino data sets requires further event filtering. Here, the choice of optimal filter criteria is usually dependent on the properties of the expected flux of the signal neutrinos and performed using the simulated event sets.

Open data sets and formats

As all of the following data is published, inter alia, via the Open Data Center, the data sets are all enriched with metadata following the KM3OpenResource description.

Particle event tables

Data generation

For particle event publication, the full information from data level 2 file reconstructed event is reduced to a “one row per event” format by selecting the relevant parameters from the level 2 files. The event and parameters selection, metadata annotation and conversion of parameters to the intended output format is performed using the km3pipe software. The prototype provenance recording has also been included in this software, so that the output of the pipeline includes already the relevant metadata as well as provenance information. The software allows writing of the data to several formats, including text-based formats and hdf5, which are the two relevant formats used in this demonstator.

Data description

Scientific use

Particle event samples can be used in both astrophyics analysis as well as neutrino oscillation studies, see the KM3NeT science targets. Therefore, the data must be made available in a format suitable for the Virtual Observatory as well as for particle physics studies.

Metadata

The events, from which relevant parameters like particle direction, time, energy and classification parameters are selected for generation of the event table, is enriched with the following metadata.

Metadata type content
Provenance information list of processing steps (referenced by identifier)
Parameter description parameter name, unit (SI), type, description, identifier
Data taking metadata start/stoptime, detector, event selection info
Publication metadata publisher, owner, creation date, version, description

Technical specification

Data structure

The general data structure is an event list which can be displayed as a flat table with parameters for one event filling one row. Each event row contains an event identifier.

File format

For the tabled event data, various output formats are used depending on the platform used for publication and the requirements for interoperability. The formats defined at the moment here are not exclusive and might be extended according to specific requests from the research community in the future.

For hdf5 files as output, various options exist to store metadata, as several tables can be written to the same file and each table and the file itself can hold additional information as attributes to the file or table. Therefore, metadata that should be easy for the user to find and read have been stored to a separate “header” table while metadata that is more relevant for the machine-based interpretation of the data has been stored as attributes.

In the case of a text-based table, csv files are generated that are accompanied by a metadata file.

output format provenance parameters data taking publication
hdf5 file header table header table header “header” table
csv table metadata file metadata file metadata file metadata file
Interfaces

VO server If the neutrino set is relevant for astrophysics analyses, a text file is generated and the metadata mapped to the resource description format required by the DaCHs software, with the simple cone search (SCS) protocol applied to it. In the ODC, the event sample is recorded as KM3OpenResource pointing to the service endpoints of the VO server. Thus, the data set is findable both through the VO registry and the ODC and accessible through VO-offered access protocols.

KM3NeT Open Data Server In the current test setup, event files that are not easily interpretable in an astrophysics context like the test sample from the ORCA detector, containing mostly atmospheric muons, are stored on the server, and registered as KM3OpenResource. While this practice is acceptable now for the relatively small datasets, the design of the server also allows in the future to point to external data sources and interface with storage locations of extended data samples.

Multimessenger alerts

Data generation

Data generation and scientific use have been described in the Multimessenger section. The output of the online reconstruction chain is an array of parameters for the identified event as json key: value dictionary, which then is annotated with the relevant metadata to match the VOEvent specifications.

Data description

The event information can, depending on its specific use, be divided into the following data or metadata categories.

(Meta)data type content
Event identification event identifier, detector
Event description type of triggers, IsRealAlert
Event coordinates time, rightascension, declination, longitude, latitude
Event properties flavor, multiplicity, energy, neutrino type, error box 50%, 90% (TOC), reconstruction quality, probability to be neutrino, probability for astrophysical origin, ranking
Publication metadata publisher, contact

Technical specification

Data structure & format

The VOEvent is stored as XML file which contains central sections of WhereWhen, Who, What, How and Why.

VO Event specifications
Section Description (Meta)data
<Who> Publication metadata including VOEvent stream identifier
<WhereWhen> Space-time coordinates event coordinates offered in UTC (time) and FK5 (equatorial coordinates) and detector location
<What> Additional parameters event properties, event identifier
<How> Additional information description of the alert type
<Why> Scientific context details on the alert procedure
Interfaces

The Alert receiving/sending is via the GCN. The Alert data will be the neutrino candidates in VOEvent format, which is the standard data format for experiments to report and communicate their observed transient celestial events facilitating for follow-ups. The alert distribution is done via Comet which is an implementation of the VOEvent transportation protocol.

Beyond this, there are also others receivers that can be implemented but are less convenient, e.g. the TNS for the optical alerts, the ZTF/LSST broker for the optical transients, the Fermi flareā€™s advocate for the Fermi blazar outbursts.

For the public alerts, KM3NeT will also submit the notice and circular (human in the loop) for the dissemination.

Supplementary services and data derivatives

Data generation

Providing context information on a broader scale in the form of e.g. sensitivity services and instrument response functions alongside the VO-published data sets is still under investigation and highly dependent on the specific information. Therefore, additional metadata for the interpretation of the format is required.

Data description

Scientific use

Models and theoretical background information used in the analysis are provided, e.g. accompagning data sets (as for the ANTARES example dataset), to statistically interpret the data sets. Alternatively, probability functions for theoretical predictions and drawn from simulations are considered for publication, including e.g. instrument response functions.

Metadata

Metadata here must be case specific:

  • Description of the structure of the data (e.g. binned data, formula), which will be indicated by a content descriptor ktype and accompanied by type-specific additional metadata
  • Description of the basic data set from which the information is derived, its scope in time and relevant restraints to the basic domain, e.g. description of the simulation sample
  • Description of all relevant parameters

Technical specification

Data structure & format

The data is provided as csv table or json with the relevant metadata provided alongside the data in a separate text file or in a header section.

Interfaces

Interprestation of the plot or service data is provided using the openkm3 package, which loads the data as KM3OpenResource from the ODC and interprets it according to the ktype. The relevant data can the be accessed either as array or, where applicable, directly be rendered to a plot using matplotlib, which can then be edited further.

Acoustic hydrophone data

Data generation

Acoustic data aquisition as described in the the sea science section offers a continuous data stream of digitized acoustic data that will undergo a filtering process according to the scientific target of the audio data. At this point, the raw acoustic data before filtering can be offered as example data and to researchers interested in sea science. Snippets of acoustic data with a duration of a few minutes are produced at a fixed interval and directly offered, after format conversion, from a data server integrated in the acoustic data acquisitioning system and made accessible through a REST API. Integrating this data stream in the open science system therefore offers a good example on demonstrating the use of a data stream offered externally to the ODC and with a growing number of individually data sets.

Data description

Scientific use The hydrophone data can be used, after triggering and filtering, for acoustic neutrino detection, detector positioning calibration and identification of marine acoustic signals, e.g. originating from whales. In the unfiltered form, the acoustic data might primarily be of interest for sea science.

Metadata

  • Publication metadata is added during record creation at the ODC
  • Instrumentation & data taking settings are offered for each data package through a separate endpoint (/info) of the REST API.

Technical specification

Data structure & format

Each data package consists of the same audio data, recorded in custom binary format (raw), which is formatted to wave and mp3 audio files. Additionally, statistical properties of the audio snipped are offered in a separate stream.

format endpoint description return format
raw /raw custom binary format application/km3net-acoustic
mp3 /mp3 mpeg encoded data audio/mpeg
wave /wav wave format data application/octet-stream
psd /psd array with mean, median, 75% and 95% quantile application/json
Interfaces

For each file, a KM3OpenResource is registered in the ODC. All resources belonging to the same data type are grouped using the KM3ResourceStream as metadata class, pointing to all resources of the data stream through the kid unique identifier. All streams belonging to the acoustic data service are grouped as KM3ResourceCollection. Thus, each single resource can be addressed as well as the logical connection between the resources preserved.

The data is directly accessible through the ODC webpage views or using openkm3 as client from a python interface.