Data-driven Analysis for Receipt Scanning Application

Conducted comprehensive analysis of scanned receipts to gain insights into customer behavior using Python and SQL for analysis, and Tableau for data visualization…
Python
SQL
Tableau
Author

Rebekah Chuang

Published

August 20, 2023

About The Project

The following csv files are the 4 given original sample data files:

  1. brands.csv

  2. receipt_items.csv

  3. receipts.csv

  4. users.csv

The goals of this project are:

  1. Review CSV data and diagram a new structured relational data model

  2. Generate queries that answer predetermined business question(s)

  3. Choose something noteworthy to share with a non-technical stakeholder

Skills: Python(NumPy, Pandas), SQL, Tableau

# library(reticulate)

# install python and create virtual env for the first time

# version = "3.9.12"
# install_python(version = version)
# virtualenv_create("my_python_env", python_version = version)

# install packages

# virtualenv_install(envname = "my_python_env", "pandas", ignore_installed = FALSE, pip_options = character())
# virtualenv_install(envname = "my_python_env", "numpy", ignore_installed = FALSE, pip_options = character())
# virtualenv_install(envname = "my_python_env", "matplotlib", ignore_installed = FALSE, pip_options = character())
# virtualenv_install(envname = "my_python_env", "plotly", ignore_installed = FALSE, pip_options = character())
# virtualenv_install(envname = "my_python_env", "seaborn", ignore_installed = FALSE, pip_options = character())
# virtualenv_install(envname = "my_python_env", "jupyter", ignore_installed = FALSE, pip_options = character())

# run this if env already exists
library(reticulate)
use_virtualenv("my_python_env")
Code
import pandas as pd
import numpy as np

Data Cleaning

First, I cleaned the data using Python(NumPy, Pandas). I dropped useless columns, converted datatype, and saved them into new files.

When saving new files, I add a suffix _new to differentiate them from the original files. However, I changed the file name to the original one when importing csv files into SQL Server to make it more convenient when running queries.

Show the first 5 rows for each table

Code
brands = pd.read_csv('data_file/brands.csv')
brands.head()
ID BARCODE BRAND_CODE CPG_ID CATEGORY CATEGORY_CODE NAME ROMANCE_TEXT RELATED_BRAND_IDS
0 5a8c35dde4b0ccf165fac9e6 511111904175 PEPPERIDGE FARM 5a734034e4b0d58f376be874 Grocery GROCERY Pepperidge Farm Pepperidge Farm has been making exceptional co... NaN
1 6234af8f4e09b6067c237adb 511111212997 CHEX MIX 6233966e8942a67af4934aa1 Snacks SNACKS CHEX MIX NaN NaN
2 5332f7d3e4b03c9a25efd14e 511111803393 NaN 5332f5f2e4b03c9a25efd0aa Snacks SNACKS Cheez-It NaN NaN
3 5d6412e9a3a018514994f426 511111304982 BETTER HOMES & GARDENS 53e10d6368abd3c7065097cc Magazines MAGAZINES Better Homes & Gardens Magazine Better Homes & Gardens offers beautiful photog... 5d6413156d5f3b23d1bc790a : 5d6417dda3a01851499...
4 621e777eacedc065cefa99a7 511111912859 TRUVIA 621e7754d759b10969cbcc08 Baking BAKING Truvia® NaN NaN
Code
receipts = pd.read_csv('data_file/receipts.csv')
receipts.head()
ID STORE_NAME PURCHASE_DATE PURCHASE_TIME DATE_SCANNED TOTAL_SPENT REWARDS_RECEIPT_STATUS USER_ID USER_VIEWED PURCHASED_ITEM_COUNT ... PENDING_DATE MODIFY_DATE FLAGGED_DATE PROCESSED_DATE FINISHED_DATE REJECTED_DATE NEEDS_FETCH_REVIEW DIGITAL_RECEIPT DELETED NON_POINT_EARNING_RECEIPT
0 62868f660a72546bef0b2dd0 TOWN OF ROCKY MOUNT 2022-05-19T00:00:00Z 2:05 PM 2022-05-19T18:41:42.53Z 859.87 FINISHED 61375682c625197c5db73602 True 1.0 ... 2022-05-19T18:57:57.228Z 2022-05-22T11:20:57.998Z 2022-05-19T18:42:57.042Z NaN 2022-05-19T18:57:57.336Z NaN NaN False NaN True
1 6096b7370a7216d316001149 NaN NaN NaN 2021-05-08T16:07:19.03Z NaN SUBMITTED 60047e8a2d7db612a69d2e18 True NaN ... NaN 2021-05-08T16:07:19.03Z NaN NaN NaN NaN NaN False NaN NaN
2 6269a4ea0a7241077408b6e1 FAMILY DOLLAR 2022-04-27T00:00:00Z 4:15 PM 2022-04-27T20:17:46.09Z 11.00 FINISHED 61578e96c2fb9321213720fc True 2.0 ... 2022-04-27T20:17:54.214Z 2022-04-27T20:17:55.521Z NaN NaN 2022-04-27T20:17:54.394Z NaN NaN False NaN False
3 625b25e70a723eb9730d2c9c PUBLIX 2022-04-15T00:00:00Z 4:45 PM 2022-04-16T20:24:07.259Z 10.67 FINISHED 6048de2ec9106b12d2b1ff61 True 4.0 ... 2022-04-16T20:24:23.399Z 2022-04-16T20:24:24.064Z NaN NaN 2022-04-16T20:24:23.598Z NaN NaN False NaN False
4 60e3bd7e0a7215bd550fb8cc COSTCO 2021-06-30T00:00:00Z 1:16 PM 2021-07-06T02:18:38.495Z 61.90 FINISHED 6048de2ec9106b12d2b1ff61 True 10.0 ... 2021-07-06T02:18:49.55Z 2021-07-06T02:18:51.906Z NaN NaN 2021-07-06T02:18:50.283Z NaN NaN False NaN False

5 rows × 21 columns

Code
receipt_items = pd.read_csv('data_file/receipt_items.csv')
receipt_items.head()
REWARDS_RECEIPT_ID ITEM_INDEX REWARDS_RECEIPT_ITEM_ID DESCRIPTION BARCODE BRAND_CODE QUANTITY_PURCHASED TOTAL_FINAL_PRICE POINTS_EARNED REWARDS_GROUP ORIGINAL_RECEIPT_ITEM_TEXT MODIFY_DATE
0 60bb28c10a720d557b128262 0 1efd6d7c75ecbae32214acb6cda41d12 RLGULAR SALE NaN NaN 1.0 9.99 NaN NaN RLGULAR SALE 2021-06-05T07:33:36.91Z
1 60bb28c10a720d557b128262 1 79482a8fa3bd0eef3d626f1c862042e8 82 GOURMET HOUSEW 000240292012 NaN 1.0 4.99 NaN NaN 82 GOURMET HOUSEW 2021-06-05T07:33:36.91Z
2 627151230a724d730825106a 0 b26669cf4ce90cc9d7d3b0ab588cb04b GOLDILOCKS NOPIA R BLAGK NaN NaN 1.0 4.75 NaN NaN GOLDILOCKS NOPIA R BLAGK 2022-05-03T15:58:41.769Z
3 627151230a724d730825106a 1 b4fafd04d8274a1e95b97155edaade2f KURI-IRI DORAYAKI CAKE NaN NaN 1.0 5.99 NaN NaN KURI-IRI DORAYAKI CAKE 2022-05-03T15:58:41.769Z
4 627151230a724d730825106a 2 39694b0880b511e8a12bfb76cf2c20f3 YIZMANG FISH BALL NaN NaN 1.0 12.00 NaN NaN YIZMANG FISH BALL 2022-05-03T15:58:41.769Z
Code
users = pd.read_csv('data_file/users.csv')
users.head()
CREATED_DATE BIRTH_DATE GENDER LAST_REWARDS_LOGIN STATE SIGN_UP_PLATFORM SIGN_UP_SOURCE ID
0 2021-12-20T00:29:17.118Z 1984-03-20T00:00:00Z transgender 2023-01-04T16:32:15Z FL NaN Apple 61bfce5d6655417f803b6538
1 2021-10-21T17:15:25.825Z 1987-08-08T05:00:00Z prefer_not_to_say 2023-01-04T16:04:33Z PA unknown Google 6171a02d8034995a3d858fb3
2 2021-10-23T19:19:18.305Z 1995-06-18T05:00:00Z male 2023-01-04T16:13:13Z FL NaN Apple 617460366e2684498143e769
3 2021-03-30T02:35:41.249Z 1999-08-23T07:00:00Z transgender 2023-01-04T16:09:51Z MI ios Google 60628e7d85cdff011de41113
4 2021-04-26T23:15:54.375Z 1992-10-28T16:16:23Z male 2023-01-04T16:24:18Z CA andrioid Email 608749aac63a95130a45fbf4

Clean table brands

Code
#drop columns that are not used in brands table
brands.drop(columns = ['ID',
                       'CPG_ID',
                       'ROMANCE_TEXT',
                       'RELATED_BRAND_IDS'],
                        axis = 1, inplace = True)
brands.head()
BARCODE BRAND_CODE CATEGORY CATEGORY_CODE NAME
0 511111904175 PEPPERIDGE FARM Grocery GROCERY Pepperidge Farm
1 511111212997 CHEX MIX Snacks SNACKS CHEX MIX
2 511111803393 NaN Snacks SNACKS Cheez-It
3 511111304982 BETTER HOMES & GARDENS Magazines MAGAZINES Better Homes & Gardens Magazine
4 511111912859 TRUVIA Baking BAKING Truvia®
Code
brands.set_index('BARCODE', inplace = True)
brands.head()
BRAND_CODE CATEGORY CATEGORY_CODE NAME
BARCODE
511111904175 PEPPERIDGE FARM Grocery GROCERY Pepperidge Farm
511111212997 CHEX MIX Snacks SNACKS CHEX MIX
511111803393 NaN Snacks SNACKS Cheez-It
511111304982 BETTER HOMES & GARDENS Magazines MAGAZINES Better Homes & Gardens Magazine
511111912859 TRUVIA Baking BAKING Truvia®

Clean table receipts

Code
#drop columns that are not used in receipts table
receipts.drop(columns = ['USER_VIEWED',
                         'PENDING_DATE',
                         'MODIFY_DATE',
                         'FLAGGED_DATE',
                         'PROCESSED_DATE',
                         'NEEDS_FETCH_REVIEW',
                         'DELETED',
                         'NON_POINT_EARNING_RECEIPT'],
                          axis = 1, inplace = True)
receipts.head()
ID STORE_NAME PURCHASE_DATE PURCHASE_TIME DATE_SCANNED TOTAL_SPENT REWARDS_RECEIPT_STATUS USER_ID PURCHASED_ITEM_COUNT CREATE_DATE FINISHED_DATE REJECTED_DATE DIGITAL_RECEIPT
0 62868f660a72546bef0b2dd0 TOWN OF ROCKY MOUNT 2022-05-19T00:00:00Z 2:05 PM 2022-05-19T18:41:42.53Z 859.87 FINISHED 61375682c625197c5db73602 1.0 2022-05-19T18:41:42.53Z 2022-05-19T18:57:57.336Z NaN False
1 6096b7370a7216d316001149 NaN NaN NaN 2021-05-08T16:07:19.03Z NaN SUBMITTED 60047e8a2d7db612a69d2e18 NaN 2021-05-08T16:07:19.03Z NaN NaN False
2 6269a4ea0a7241077408b6e1 FAMILY DOLLAR 2022-04-27T00:00:00Z 4:15 PM 2022-04-27T20:17:46.09Z 11.00 FINISHED 61578e96c2fb9321213720fc 2.0 2022-04-27T20:17:46.09Z 2022-04-27T20:17:54.394Z NaN False
3 625b25e70a723eb9730d2c9c PUBLIX 2022-04-15T00:00:00Z 4:45 PM 2022-04-16T20:24:07.259Z 10.67 FINISHED 6048de2ec9106b12d2b1ff61 4.0 2022-04-16T20:24:07.259Z 2022-04-16T20:24:23.598Z NaN False
4 60e3bd7e0a7215bd550fb8cc COSTCO 2021-06-30T00:00:00Z 1:16 PM 2021-07-06T02:18:38.495Z 61.90 FINISHED 6048de2ec9106b12d2b1ff61 10.0 2021-07-06T02:18:38.495Z 2021-07-06T02:18:50.283Z NaN False
Code
receipts.set_index('ID', inplace = True)
receipts.head()
STORE_NAME PURCHASE_DATE PURCHASE_TIME DATE_SCANNED TOTAL_SPENT REWARDS_RECEIPT_STATUS USER_ID PURCHASED_ITEM_COUNT CREATE_DATE FINISHED_DATE REJECTED_DATE DIGITAL_RECEIPT
ID
62868f660a72546bef0b2dd0 TOWN OF ROCKY MOUNT 2022-05-19T00:00:00Z 2:05 PM 2022-05-19T18:41:42.53Z 859.87 FINISHED 61375682c625197c5db73602 1.0 2022-05-19T18:41:42.53Z 2022-05-19T18:57:57.336Z NaN False
6096b7370a7216d316001149 NaN NaN NaN 2021-05-08T16:07:19.03Z NaN SUBMITTED 60047e8a2d7db612a69d2e18 NaN 2021-05-08T16:07:19.03Z NaN NaN False
6269a4ea0a7241077408b6e1 FAMILY DOLLAR 2022-04-27T00:00:00Z 4:15 PM 2022-04-27T20:17:46.09Z 11.00 FINISHED 61578e96c2fb9321213720fc 2.0 2022-04-27T20:17:46.09Z 2022-04-27T20:17:54.394Z NaN False
625b25e70a723eb9730d2c9c PUBLIX 2022-04-15T00:00:00Z 4:45 PM 2022-04-16T20:24:07.259Z 10.67 FINISHED 6048de2ec9106b12d2b1ff61 4.0 2022-04-16T20:24:07.259Z 2022-04-16T20:24:23.598Z NaN False
60e3bd7e0a7215bd550fb8cc COSTCO 2021-06-30T00:00:00Z 1:16 PM 2021-07-06T02:18:38.495Z 61.90 FINISHED 6048de2ec9106b12d2b1ff61 10.0 2021-07-06T02:18:38.495Z 2021-07-06T02:18:50.283Z NaN False
Code
#change the data type for PURCHASED_ITEM_COUNT into int
receipts['PURCHASED_ITEM_COUNT'] = receipts['PURCHASED_ITEM_COUNT'].astype(float).astype('Int64')
receipts.head()
STORE_NAME PURCHASE_DATE PURCHASE_TIME DATE_SCANNED TOTAL_SPENT REWARDS_RECEIPT_STATUS USER_ID PURCHASED_ITEM_COUNT CREATE_DATE FINISHED_DATE REJECTED_DATE DIGITAL_RECEIPT
ID
62868f660a72546bef0b2dd0 TOWN OF ROCKY MOUNT 2022-05-19T00:00:00Z 2:05 PM 2022-05-19T18:41:42.53Z 859.87 FINISHED 61375682c625197c5db73602 1 2022-05-19T18:41:42.53Z 2022-05-19T18:57:57.336Z NaN False
6096b7370a7216d316001149 NaN NaN NaN 2021-05-08T16:07:19.03Z NaN SUBMITTED 60047e8a2d7db612a69d2e18 <NA> 2021-05-08T16:07:19.03Z NaN NaN False
6269a4ea0a7241077408b6e1 FAMILY DOLLAR 2022-04-27T00:00:00Z 4:15 PM 2022-04-27T20:17:46.09Z 11.00 FINISHED 61578e96c2fb9321213720fc 2 2022-04-27T20:17:46.09Z 2022-04-27T20:17:54.394Z NaN False
625b25e70a723eb9730d2c9c PUBLIX 2022-04-15T00:00:00Z 4:45 PM 2022-04-16T20:24:07.259Z 10.67 FINISHED 6048de2ec9106b12d2b1ff61 4 2022-04-16T20:24:07.259Z 2022-04-16T20:24:23.598Z NaN False
60e3bd7e0a7215bd550fb8cc COSTCO 2021-06-30T00:00:00Z 1:16 PM 2021-07-06T02:18:38.495Z 61.90 FINISHED 6048de2ec9106b12d2b1ff61 10 2021-07-06T02:18:38.495Z 2021-07-06T02:18:50.283Z NaN False

Clean table receipt_items

Code
receipt_items.drop(columns = ['REWARDS_RECEIPT_ITEM_ID',
                              'POINTS_EARNED',
                              'REWARDS_GROUP',
                              'ORIGINAL_RECEIPT_ITEM_TEXT',
                              'MODIFY_DATE'],
                               axis = 1, inplace = True)
receipt_items.head()
REWARDS_RECEIPT_ID ITEM_INDEX DESCRIPTION BARCODE BRAND_CODE QUANTITY_PURCHASED TOTAL_FINAL_PRICE
0 60bb28c10a720d557b128262 0 RLGULAR SALE NaN NaN 1.0 9.99
1 60bb28c10a720d557b128262 1 82 GOURMET HOUSEW 000240292012 NaN 1.0 4.99
2 627151230a724d730825106a 0 GOLDILOCKS NOPIA R BLAGK NaN NaN 1.0 4.75
3 627151230a724d730825106a 1 KURI-IRI DORAYAKI CAKE NaN NaN 1.0 5.99
4 627151230a724d730825106a 2 YIZMANG FISH BALL NaN NaN 1.0 12.00
Code
#change the data type for QUANTITY_PURCHASED into int
receipt_items['QUANTITY_PURCHASED'] = receipt_items['QUANTITY_PURCHASED'].astype(float).astype('Int64')
receipt_items.head()
REWARDS_RECEIPT_ID ITEM_INDEX DESCRIPTION BARCODE BRAND_CODE QUANTITY_PURCHASED TOTAL_FINAL_PRICE
0 60bb28c10a720d557b128262 0 RLGULAR SALE NaN NaN 1 9.99
1 60bb28c10a720d557b128262 1 82 GOURMET HOUSEW 000240292012 NaN 1 4.99
2 627151230a724d730825106a 0 GOLDILOCKS NOPIA R BLAGK NaN NaN 1 4.75
3 627151230a724d730825106a 1 KURI-IRI DORAYAKI CAKE NaN NaN 1 5.99
4 627151230a724d730825106a 2 YIZMANG FISH BALL NaN NaN 1 12.00
Code
receipt_items.set_index('REWARDS_RECEIPT_ID', inplace = True)
receipt_items.head()
ITEM_INDEX DESCRIPTION BARCODE BRAND_CODE QUANTITY_PURCHASED TOTAL_FINAL_PRICE
REWARDS_RECEIPT_ID
60bb28c10a720d557b128262 0 RLGULAR SALE NaN NaN 1 9.99
60bb28c10a720d557b128262 1 82 GOURMET HOUSEW 000240292012 NaN 1 4.99
627151230a724d730825106a 0 GOLDILOCKS NOPIA R BLAGK NaN NaN 1 4.75
627151230a724d730825106a 1 KURI-IRI DORAYAKI CAKE NaN NaN 1 5.99
627151230a724d730825106a 2 YIZMANG FISH BALL NaN NaN 1 12.00

Clean table users

Code
users.drop(columns = ['LAST_REWARDS_LOGIN'], axis = 1, inplace = True)
users.head()
CREATED_DATE BIRTH_DATE GENDER STATE SIGN_UP_PLATFORM SIGN_UP_SOURCE ID
0 2021-12-20T00:29:17.118Z 1984-03-20T00:00:00Z transgender FL NaN Apple 61bfce5d6655417f803b6538
1 2021-10-21T17:15:25.825Z 1987-08-08T05:00:00Z prefer_not_to_say PA unknown Google 6171a02d8034995a3d858fb3
2 2021-10-23T19:19:18.305Z 1995-06-18T05:00:00Z male FL NaN Apple 617460366e2684498143e769
3 2021-03-30T02:35:41.249Z 1999-08-23T07:00:00Z transgender MI ios Google 60628e7d85cdff011de41113
4 2021-04-26T23:15:54.375Z 1992-10-28T16:16:23Z male CA andrioid Email 608749aac63a95130a45fbf4
Code
users.set_index('ID', inplace = True)
users.head()
CREATED_DATE BIRTH_DATE GENDER STATE SIGN_UP_PLATFORM SIGN_UP_SOURCE
ID
61bfce5d6655417f803b6538 2021-12-20T00:29:17.118Z 1984-03-20T00:00:00Z transgender FL NaN Apple
6171a02d8034995a3d858fb3 2021-10-21T17:15:25.825Z 1987-08-08T05:00:00Z prefer_not_to_say PA unknown Google
617460366e2684498143e769 2021-10-23T19:19:18.305Z 1995-06-18T05:00:00Z male FL NaN Apple
60628e7d85cdff011de41113 2021-03-30T02:35:41.249Z 1999-08-23T07:00:00Z transgender MI ios Google
608749aac63a95130a45fbf4 2021-04-26T23:15:54.375Z 1992-10-28T16:16:23Z male CA andrioid Email

Save new csv files

Code
brands.to_csv('data_file/brands_new.csv')
receipts.to_csv('data_file/receipts_new.csv')
receipt_items.to_csv('data_file/receipt_items_new.csv')
users.to_csv('data_file/users_new.csv')

Design a New Relational Data Model

The next step is diagramming a new structured relational data model. After cleaning the data, I created a new ER diagram(Entity-Relationship Diagram) using Lucidchart for this database(shown below).

  • PK stands for Primary Key
  • FK stands for Foreign Key

SQL Queries

Question1: Which brand saw the most dollars spent in the month of June?

WITH cte AS
(
   SELECT
       b.BARCODE,
       b.NAME AS brand_name,
       ri.TOTAL_FINAL_PRICE,
       r.PURCHASE_DATE,
       MONTH(r.PURCHASE_DATE) AS month
   FROM brands AS b
   JOIN receipt_items AS ri
   ON b.BARCODE = ri.BARCODE
   JOIN receipts AS r
   ON r.ID = ri.REWARDS_RECEIPT_ID
)

SELECT
   TOP 1 brand_name,
   SUM(TOTAL_FINAL_PRICE) AS price,
   month
FROM cte
GROUP BY brand_name, month
HAVING MONTH = 6
ORDER BY price DESC;

Question2: Which user spent the most money in the month of August?

WITH cte AS
(
    SELECT
        u.ID AS user_id,
        MONTH(r.PURCHASE_DATE) AS month,
        r.TOTAL_SPENT
    FROM users AS u
    JOIN receipts AS r
    ON u.ID = r.USER_ID
)

SELECT
    TOP 1 user_id,
    month,
    SUM(TOTAL_SPENT) AS total_spent
FROM cte
GROUP BY user_id, month
HAVING month = 8
ORDER BY total_spent DESC;

Question3: What user bought the most expensive item? Question4: What is the name of the most expensive item purchased?

WITH cte AS
(
    SELECT
        u.ID AS user_id,
        r.ID AS receipt_id,
        ri.DESCRIPTION AS item_name,
        CAST(ROUND(ri.TOTAL_FINAL_PRICE, 2) AS DECIMAL(10,2)) AS total_final_price,
        ri.QUANTITY_PURCHASED AS quantity_purchased,
        CAST(ROUND(total_final_price/quantity_purchased, 2) AS FLOAT) AS item_price
    FROM users AS u
    JOIN receipts AS r
    ON u.ID = r.USER_ID
    JOIN receipt_items AS ri
    ON r.ID = ri.REWARDS_RECEIPT_ID
    WHERE (total_final_price IS NOT NULL)
    AND (total_final_price != 0)
    AND (quantity_purchased IS NOT NULL)
    AND (quantity_purchased != 0)
)
SELECT
    TOP 1 user_id,
    item_name,
    item_price
FROM cte
ORDER BY item_price DESC;

Question5: How many users scanned in each month?

WITH cte AS
(
    SELECT
        u.ID AS user_id,
        MONTH(r.DATE_SCANNED) AS month_scanned
    FROM users AS u
    JOIN receipts AS r
    ON u.ID = r.USER_ID
)

SELECT
    CASE WHEN month = 1 THEN 'JAN'
    WHEN month = 2 THEN 'FEB'
    WHEN month = 3 THEN 'MAR'
    WHEN month = 4 THEN 'APR'
    WHEN month = 5 THEN 'MAY'
    WHEN month = 6 THEN 'JUN'
    WHEN month = 7 THEN 'JUL'
    WHEN month = 8 THEN 'AUG'
    WHEN month = 9 THEN 'SEP'
    WHEN month = 10 THEN 'OCT'
    WHEN month = 11 THEN 'NOV'
    ELSE 'DEC' END AS month,
    user_id_count
FROM
(
    SELECT
        TOP 12 month_scanned AS month,
        COUNT(user_id) AS user_id_count
    FROM cte
    GROUP BY month_scanned
    ORDER BY month
)sub;

Question6: What is the top 10 category that has the most brands?

SELECT
    TOP 10 COUNT(NAME) AS brand_count,
    CATEGORY AS category
FROM brands
GROUP BY category
HAVING category IS NOT NULL
ORDER BY brand_count DESC;

Question7: How much does each store earn in 2022 (list top 20)?

SELECT
    TOP 20 STORE_NAME AS store_name,
    ROUND(SUM(TOTAL_SPENT), 3) AS revenue
FROM receipts
WHERE YEAR(PURCHASE_DATE) = 2022
GROUP BY store_name
ORDER BY revenue DESC;

Question8: What are the top 5 states that registered users come from?

SELECT
    TOP 5 STATE AS state,
    COUNT(*) AS count
FROM users
GROUP BY state
ORDER BY count DESC;

Question9: What is the age distribution of the registered users?

WITH cte AS
(
    SELECT
        *,
        CASE WHEN age BETWEEN 0 AND 10 THEN '0-10'
        WHEN age BETWEEN 11 AND 20 THEN '11-20'
        WHEN age BETWEEN 21 AND 30 THEN '21-30'
        WHEN age BETWEEN 31 AND 40 THEN '31-40'
        WHEN age BETWEEN 41 AND 50 THEN '41-50'
        WHEN age BETWEEN 51 AND 60 THEN '51-60'
        WHEN age BETWEEN 61 AND 70 THEN '61-70'
        WHEN age BETWEEN 71 AND 80 THEN '71-80'
        END AS age_range
    FROM
    (
        SELECT
            *,
            2023 - YEAR(BIRTH_DATE) AS age
        FROM users
    ) sub
)

SELECT
    age_range,
    COUNT(*) AS age_range_count
FROM cte
GROUP BY age_range;

Question10: What is the distribution of the sign up source?

SELECT
    SIGN_UP_SOURCE AS sign_up_source,
    COUNT(*) AS source_count
FROM users
GROUP BY sign_up_source;

Insights

For the last part, I used Tableau to create visualizations and dashboards to help non-technical stakeholders easily understand my analysis.

1. Distribution of Registered Users (Map):

The visualization below shows the distribution of registered users. From this visualization, we know that Florida State has the most registered users since the color of Florida is the darkest on the whole map.

2. Rewards Receipts Status (Highlight table):

From the highlight table below, we can know the number and percentage of the receipt that is FINISHED/FLAGGED/PENDING/REJECTED/SUBMITTED. For example, among all the rewards receipts, 91.15% of them are finished(which is 64,350). The darker color represents the higher percentage/number.

3. Top 20 Stores’ Revenue By Year(2020-2023) (Treemap):

I created a treemap in the following dashboard to visualize the store’s revenue(top 20) each year. Users can use the slider above the dashboard to filter the year they want. The bigger the area, the larger the revenue.

Users can see the corresponding store name and revenue that year when they hover over the dashboard.

2020

2021

2022

2023

4. Brand Count(By Category) (Treemap):

This visualization can be used to answer Question 6. From this treemap, we know that Beverages is the category with the most brand among all the categories. Users can see the category and brand count when they hover over the visualization.

5. Scanned User(By Year/Quarter) (Bar chart):

The dashboard below can be used to answer Question 5 and helps stakeholders visualize the number of users scanned in a specific year or quarter. Users can use the slider to filter quarters and the dropdown to filter years.