Datasets available at hackathon.lu
At hackathon.lu 2025, we believe in the power of open data and large datasets to drive innovation, foster creativity, and enable groundbreaking solutions in open-source security tooling. To support participants in their projects and topics, we will provide access to a curated selection of public and semi-public datasets spanning various domains. These datasets are contributed by different organizations, and we welcome additional contributions to further enrich the resources available during the hackathon.
Datasets

Common Crawl (Provider: Common Crawl)
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
During the hackathon, the CC-MAIN-2024-51 crawl dataset (80TB+) will be available locally, including indexes.
Collected X.509 Certificates (Provider: CIRCL)
CIRCL collects X.509 certificates using various scanning techniques.
During the hackathon, a dataset in Parquet format containing over 100 million X.509 certificates will be available.
Collected SSH Key Materials and Metadata (Provider: CIRCL)
CIRCL collects SSH key materials and metadata from various scanning techniques.
During the hackathon, a dataset in Parquet format will be available, containing SSH key materials and associated metadata.
Greynoise HTTP Traffic - Log4Shell (Provider: Greynoise)
A dataset in Parquet format containing HTTP traffic related to Log4Shell, collected by Greynoise between October 2021 and April 2022, will be available.
Vulnerabilities (Provider: CIRCL)
Daily dumps of security advisories from over 20 sources (CVE Program, NVD, PySec, GHSA, various CSAF feeds, etc.).
Additionally, AI datasets and models will be available.