HnM Search Dataset Created from Recommendations Dataset
This synthetic data-set is created using the recommendations dataset:
as base. The base dataset is a recommendations data set where transactions data has the articles purchased by the users. This dataset gives the search queries, which may have been issued by the user before buying the article, along with the candidate results.
The license for our additions is https://cdla.dev/permissive-2-0/
Search Queries Dataset
queries.csv: 253685 List of queries for transactions.
qrels.csv: 253685 List of positive and negative article-ids which were retrieved for each query.
Base Dataset
articles.csv: 105542 List of unique products/articles with their properties/features.
customers.csv: 1371980 List of unique customers/users with their properties/features.
transactions_train.csv: 31788324 List of historical transactions/purchases of different articles by customers.
π Dataset Structure & Components
All search queries data is located in the folder 'data/search/' directory.
data/search/queries.csv
Queries generated from individual transactions (transactions_train.csv).
(253685 rows, 3 columns: query_id, transaction_id, and query_text)
data/search/qrels.csv
Query results candidates-- positives (from the transaction) and close negatives article_ids (from articles.csv) .
(253685 rows, 3 columns: query_id, positive_ids, negatives_ids (space separated))
All raw (recommendations) data is located in the data/raw/ directory.
data/raw/transactions_train.csv
A historical record of all purchase transactions. This file serves as a central table connecting customers with the articles they purchased.
(31,788,324 rows, 5 columns)
data/raw/customers.csv
This dimension table contains attributes for each unique customer.
(1,371,980 rows, 7 columns)
data/raw/articles.csv
This dimension table contains highly detailed attributes for each unique product (article).
(105,542 rows, 25 columns)
data/raw/images/
This directory contains product images, organized into subdirectories based on the first 3 digits of the article_id.
π Relationships Between Search Data
These files can be combined (joined) to create a comprehensive dataset for analysis:
query_id can be used to join the files queries.csv and qrels.csv to get the textual queries and the corresponding resultant articles.
Similarly, transaction_id (from queries.csv) can be used to get the details of corresponding transactions using transactions_train.csv.
positive_ids and negative_ids (from qrels.csv) can be used to join with articles.csv to get the details of the result articles (both positive-- which the user purchased-- and negatives)
πData Schema
Data schema for transactions_train.csv, 'customers.csv', and 'articles.csv' can be obtained from https://huggingface.co/datasets/einrafh/hnm-fashion-recommendations-data.
Here is the schema for the search data.
`queries.csv`
| column |
Description |
Type |
query_id |
Unique ID for the query(Primary Key) |
object (String) |
transaction_id |
Unique ID for the transaction(Foreign Key) |
object (String) |
query_text |
Text of the query |
object (String) |
`qrels.csv`
| column |
Description |
Type |
query_id |
ID for the query(Foreign Key) |
object (String) |
positive_ids |
ID for the positive result(Foreign Key) which the user clicked/purchased |
object (String) |
negative_ids |
Space separated list of IDs for the negative result(Foreign Key) which the user didn't click/purchase |
object (String) |
π Source
The base dataset is provided to the public by H&M Group through the Kaggle platform for analysis and research purposes. We have added search queries over the base dataset.
β οΈ License
The use of this dataset is subject to the terms and conditions stated on its original distribution page. This dataset is intended for non-commercial and research purposes.
π Structured Schema (Zero-Fabrication)
| Feature Key |
Data Type |
article_id |
int64 |
product_code |
int64 |
prod_name |
string |
product_type_no |
int64 |
product_type_name |
string |
product_group_name |
string |
graphical_appearance_no |
int64 |
graphical_appearance_name |
string |
colour_group_code |
int64 |
colour_group_name |
string |
perceived_colour_value_id |
int64 |
perceived_colour_value_name |
string |
perceived_colour_master_id |
int64 |
perceived_colour_master_name |
string |
department_no |
int64 |
department_name |
string |
index_code |
string |
index_name |
string |
index_group_no |
int64 |
index_group_name |
string |
section_no |
int64 |
section_name |
string |
garment_group_no |
int64 |
garment_group_name |
string |
detail_desc |
string |
Estimated Rows: 105,542