XGBoost to optimize the budget division and maximize the sales for each vendor product

Categories

we have a list of vendor-products with vendor discounts, and the historical data that shows each products with how much total discount sold in each vendor and vendor area, we have a limited budget to add on top to vendor discounts to sell the product better. please write a python code to optimize the budget division and maximize the sales for each vendor product based on historical data.
historical data of sales is in a database table and has 5377303 records and is growing daily. and input list is in a csv file. the list contains vendor_id, product_id and vendor_discount (in percentage) and we will complete the list by getting the vendor details like city, business line and area, product details like category and sub category and number of sold item quantity per area in last 30 days.

  • Vendor ID: Identifies which vendor sold the product.
  • Vendor Area ID: Identifies the area where the vendor is located.
  • Product ID: Identifies the specific product sold.
  • Product Category ID: Identifies the category of the product.
  • Product Sub-Category ID: Identifies the sub-category of the product.
  • Product Vendor Discount: The discount applied by the vendor to the product.
  • Product On-Top Discount: The additional discount applied on top of the vendor discount.
  • Product Quantity: The quantity of the product sold.
  • Product Total Price: The total price of the product (quantity * unit price).
  • Product Total Discount Amount: The total discount amount applied to the product.
  • Product Sub-Total Amount: The subtotal amount of the product (total price – total discount amount).

the input list is a csv file like : vendor_id, product_id, vendor_discount 20873742,10023452,0.2 33772771,87817232,0.3 historical data is an order detail table that has each order product and its details like: order_detail_id , order_id,created_at, vendor_id , vendor_city_id, vedor_business_line_id , vendor_area_id, product_id , product_category_id, product_sub_category_id , product_vendor_discount , product_on_top_discount, product_total_discount , product_quantity, product_unit_price , product_total_price, product_total_discount_amount, product_sub_total_amount 123142,123232,'2022-01-01 00:23:23',234124,123,1,123,123123,1233,4321,0.2, 0.1,0.3,5,100,500,150,350

we want to write a python code use machine learning to train a model with historical data with a fast, optimized and parallel solution and then get the csv input list and budget limit for tomorrow campaign from user and optimize the product_on_top_discount based on budget to maximize the sales of each vendor – product for tomorrow campaign.

To optimize the product_on_top_discount based on budget to maximize sales for each vendor – product, we can use a machine learning model that takes into account the historical data of each vendor’s product sales and predicts the optimal product_on_top_discount for each vendor – product combination given a budget limit.

Here’s an outline of the steps to achieve this:

  1. Connect to the database and extract the historical order details data.
  2. Clean and preprocess the data, converting it into a suitable format for machine learning.
  3. Split the data into training and testing sets.
  4. Train a regression model using the training data to predict the total product sales for each vendor – product combination.
  5. Use the trained model to predict the optimal product_on_top_discount for each vendor – product combination given a budget limit.
  6. Output the optimized list of vendor – product combinations and their corresponding product_on_top_discounts.

some guidance on how you could approach building a model to predict optimal discounts for each vendor-product combination based on their sub-category and area.

To start, you would need to extract the relevant data from your MySQL database table and prepare it for analysis. This might involve cleaning the data, transforming it into a format that’s more suitable for machine learning, and splitting the data into training and testing sets.

Once you’ve prepared the data, you could use a regression algorithm such as linear regression or decision tree regression to build a predictive model. You would want to use the historical sales data to train the model and then test its accuracy on the testing set.

In terms of feature engineering, you might consider creating new variables that capture interactions between the existing variables. For example, you could create a variable that represents the interaction between the vendor ID and product sub-category ID to capture any synergies that exist between certain vendors and sub-categories.

You may also want to consider using regularization techniques such as L1 or L2 regularization to prevent overfitting and improve the generalizability of the model.

Ultimately, the goal would be to build a model that accurately predicts the optimal discount for each vendor-product combination based on their sub-category and area.

Here’s a Python code implementation:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# Load historical order detail data
order_detail_df = pd.read_csv('order_detail.csv')

# Load input list 
input_list_df = pd.read_csv('input_list.csv')

# Join order detail with product and vendor information
data_df = pd.merge(order_detail_df, input_list_df, how='inner', on=['vendor_id', 'product_id'])

# Select relevant columns
data_df = data_df[['vendor_id', 'product_id', 'product_quantity', 'product_total_discount', 'product_on_top_discount']]

# Split data into training and testing sets
X = data_df[['vendor_id', 'product_id', 'product_total_discount']]
y = data_df['product_quantity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


# Get the budget limit from user
budget_limit = float(input("Enter budget limit: "))

# Calculate predicted sales for each vendor-product combination
input_list_df['predicted_sales'] = model.predict(input_list_df[['vendor_id', 'product_id', 'vendor_discount']])

# Normalize predicted sales by dividing by total predicted sales across all products
total_predicted_sales = input_list_df['predicted_sales'].sum()
input_list_df['normalized_sales'] = input_list_df['predicted_sales'] / total_predicted_sales

# Calculate additional discount based on budget limit and normalized sales
input_list_df['additional_discount'] = (budget_limit * input_list_df['normalized_sales']) - (budget_limit * input_list_df['vendor_discount'])

# Calculate optimal product_on_top_discount
input_list_df['product_on_top_discount'] = input_list_df[['vendor_discount', 'additional_discount']].min(axis=1)

# Save optimized budget and discounts in a new CSV file
input_list_df[['vendor_id', 'product_id', 'product_on_top_discount']].to_csv('optimized_budget_and_discounts.csv', index=False)

Sure, here’s the revised Python code that uses polars and parallel processing to optimize the budget distribution and calculate product_on_top_discount for each vendor-product combination:

import polars as pl
from polars import col, when
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load historical order detail data
order_detail_df = pl.read_csv('order_detail.csv', n_threads=4)

# Load input list
input_list_df = pl.read_csv('input_list.csv', n_threads=4)

# Join order detail with product and vendor information
data_df = order_detail_df.join(input_list_df, on=['vendor_id', 'product_id'], how='inner')

# Select relevant columns
data_df = data_df.select(['vendor_id', 'product_id', 'product_quantity', 'product_total_discount', 'product_on_top_discount'])

# Split data into training and testing sets
X = data_df[['vendor_id', 'product_id', 'product_total_discount']]
y = data_df['product_quantity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train linear regression model
model = LinearRegression()
model.fit(X_train.to_pandas(), y_train.to_pandas())

# Get the budget limit from user
budget_limit = float(input("Enter budget limit: "))

# Calculate predicted sales for each vendor-product combination
predicted_sales = model.predict(input_list_df[['vendor_id', 'product_id', 'vendor_discount']].to_pandas())
input_list_df = input_list_df.with_column('predicted_sales', predicted_sales)

# Normalize predicted sales by dividing by total predicted sales across all products
total_predicted_sales = input_list_df['predicted_sales'].sum()
input_list_df = input_list_df.with_column('normalized_sales', input_list_df['predicted_sales'] / total_predicted_sales)

# Calculate additional discount based on budget limit and normalized sales
additional_discount = (budget_limit * input_list_df['normalized_sales']) - (budget_limit * input_list_df['vendor_discount'])
input_list_df = input_list_df.with_column('additional_discount', additional_discount)

# Calculate optimal product_on_top_discount
input_list_df = input_list_df.with_column('product_on_top_discount', when(col('additional_discount') < col('vendor_discount'), col('additional_discount')).otherwise(col('vendor_discount')))

# Save optimized budget and discounts in a new CSV file
input_list_df.select(['vendor_id', 'product_id', 'product_on_top_discount']).to_csv('optimized_budget_and_discounts.csv', index=False)

In this revision, we have replaced pandas with polars, which is a fast and efficient DataFrame library for Python that supports parallel processing.

By setting the n_threads parameter to 4 in pl.read_csv(), we have instructed polars to use four threads to load the data in parallel. We have also used train_test_split() function from sklearn to split the data into training and testing sets.

Next, we have trained the linear regression model using LinearRegression() function from sklearn. After getting the budget limit from the user, we have calculated predicted sales for each vendor-product combination using the trained model.

Then, we have normalized the predicted sales by dividing by the total predicted sales across all products. We have calculated additional discount based on the budget limit and normalized sales. Finally, we have calculated the optimal product_on_top_discount by comparing additional discount and vendor discount for each vendor-product combination.

Finally, we have saved the resulting dataframe to a new CSV file using to_csv() method.

Note that polars provides built-in support for parallel processing, which makes it faster than pandas for large datasets.