与えられたデータについて見ていき、現状把握や、注文に影響を及ぼす要因を探っていきます。
kaggle data description (データの説明)
ここからは、実際にデータを整理していきます(コード部分に興味のない人はコードは読み飛ばして頂いて結構です)。
参考:
https://www.kaggle.com/philippsp/exploratory-analysis-instacart
https://www.kaggle.com/newforme/exploratory-data-analysis-instacart
準備
# ライブラリの読み込み
library(data.table)
library(dplyr)
library(ggplot2)
library(stringr)
library(treemap)
# データの読み込み
orders <- fread('./data/case02_orders.csv', showProgress = FALSE, data.table = FALSE)
products <- fread('./data/case02_products.csv', showProgress = FALSE, data.table = FALSE)
order_products <- fread('./data/case02_order_products__train.csv', showProgress = FALSE, data.table = FALSE)
order_products_prior <- fread('./data/case02_order_products__prior.csv', showProgress = FALSE, data.table = FALSE)
aisles <- fread('./data/case02_aisles.csv', showProgress = FALSE, data.table = FALSE)
departments <- fread('./data/case02_departments.csv', showProgress = FALSE, data.table = FALSE)
# 前処理
orders <- orders %>%
mutate(order_hour_of_day = as.numeric(order_hour_of_day),
eval_set = as.factor(eval_set))
products <- products %>%
mutate(product_name = as.factor(product_name))
aisles <- aisles %>%
mutate(aisle = as.factor(aisle))
departments <- departments %>%
mutate(department = as.factor(department))
orders %>%
ggplot(aes(x = order_hour_of_day)) +
geom_histogram(bins = 24) +
labs(x = "hour")
8時から18時に注文が多いいようです。
orders %>%
ggplot(aes(x = order_dow)) +
geom_histogram() +
labs(x = "day of week")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
それぞれが何曜日かは情報が与えられていませんが、注文が多い曜日があるようです。
orders %>%
ggplot(aes(x = days_since_prior_order)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1週間おきに注文される方が多いようです。
order_products %>%
group_by(order_id) %>%
summarize(n_items = last(add_to_cart_order)) %>%
ggplot(aes(x = n_items))+
geom_histogram() +
geom_rug()+
coord_cartesian(xlim=c(0,80))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
order_products %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(reordered = as.factor(reordered)) %>%
mutate(proportion = count/sum(count)) %>%
ggplot(aes(x=reordered,y=count,fill=reordered))+
geom_bar(stat="identity")
tmp <- products %>%
group_by(department_id, aisle_id) %>%
summarize(n=n()) %>%
left_join(departments,by="department_id") %>%
left_join(aisles,by="aisle_id")
order_products %>%
group_by(product_id) %>%
summarize(count=n()) %>%
left_join(products,by="product_id") %>%
ungroup() %>%
group_by(department_id,aisle_id) %>%
summarize(sumcount = sum(count)) %>%
left_join(tmp, by = c("department_id", "aisle_id")) %>%
mutate(onesize = 1) %>%
treemap(index=c("department","aisle"), vSize="sumcount",
title="",palette="Set3",border.col="#FFFFFF")
それぞれのサイズは注文回数を示します。生鮮食品の注文回数が多いです。
「データから価値を創造する」一般社団法人データマーケティングラボラトリー
Copyright© DML All Rights Reserved.