Datasets available at hackathon.lu

At hackathon.lu 2025, we believe in the power of open data and large datasets to drive innovation, foster creativity, and enable groundbreaking solutions in open-source security tooling. To support participants in their projects and topics, we will provide access to a curated selection of public and semi-public datasets spanning various domains. These datasets are contributed by different organizations, and we welcome additional contributions to further enrich the resources available during the hackathon.

Datasets

greynoise

Common Crawl (Provider: Common Crawl)

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

During the hackathon, the CC-MAIN-2024-51 crawl dataset (80TB+) will be available locally, including indexes.

Terms of Use.

Collected X.509 Certificates (Provider: CIRCL)

CIRCL collects X.509 certificates using various scanning techniques.

During the hackathon, a dataset in Parquet format containing over 100 million X.509 certificates will be available.

Collected SSH Key Materials and Metadata (Provider: CIRCL)

CIRCL collects SSH key materials and metadata from various scanning techniques.

During the hackathon, a dataset in Parquet format will be available, containing SSH key materials and associated metadata.

Greynoise HTTP Traffic - Log4Shell (Provider: Greynoise)

A dataset in Parquet format containing HTTP traffic related to Log4Shell, collected by Greynoise between October 2021 and April 2022, will be available.

Vulnerabilities (Provider: CIRCL)

Daily dumps of security advisories from over 20 sources (CVE Program, NVD, PySec, GHSA, various CSAF feeds, etc.).

Additionally, AI datasets and models will be available.