Supplementary Material

JDS

Journal of Data Science

1683-86021680-743X

1680-743X

School of Statistics, Renmin University of China

JDS1198

10.6339/25-JDS1198

Data Science in Action

Label-efficient Response Modelling: Cost-Effective Marketing Using Cluster-Based Active Sampling

https://orcid.org/0009-0007-7739-6743

Tan

Swee Chuan

jamestansc@suss.edu.sg1 1School of Business, Singapore University of Social Sciences, Singapore

2025

392025

00114

Supplementary Material

The Python notebook containing the implementation of the proposed method is available at the following link: https://colab.research.google.com/drive/1IG-9N7iakfPUUKnIskKYH2kbs_F6sdgP?usp=sharing.

Additionally, the datasets used in this study, where features are ordered by their importance (from left to right), can be accessed via: https://drive.google.com/drive/folders/1WE8A0aZ-cKLJ45hRDMFH20CczwZ2wiWh?usp=sharing.

16420251872025

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

2025

Open access article under the CC BY license.

This paper introduces a label-efficient response modelling method useful when the target labels are unknown a priori. Unlike most response modelling methods that adopt a supervised or semi-supervised approach, we apply clustering to partition data into homogeneous segments, which are assumed to reflect the underlying response behaviours. We then take a random sample from each cluster. For each sampled record, the true target label is acquired. Through this cluster-based stratified sampling approach, we reduced the cost of label acquisition needed to estimate the cluster-specific and overall basic response rates. The goal is to identify a subset of the population more likely to respond (e.g., make a purchase) while controlling campaign costs. This idea of subsetting the population represents a departure from conventional classification tasks, which require full labeling of all observations. We regard clusters with response rates significantly higher than the estimated basic response rate as high-propensity clusters and proceed to acquire all their remaining labels. Our experimental results show that the response rates of high-propensity clusters are at least 1.7 times the basic response rate. This suggests that the proposed approach significantly reduces costs by targeting only high-propensity groups and is useful in scenarios lacking historical ground truth.

Keywords active learning data-efficient learning imbalanced data predictive modelling semi-supervised learning stratified sampling

References

Ali

, Abd Razak

, Othman

, Eisa

TAE

, Al-Dhaqm

, Nasser

, et al. (2022). Financial fraud detection based on machine learning: A systematic literature review. Applied Sciences, 12(19): 9637. https://doi.org/10.3390/app12199637

Baesens

(2004). Developing intelligent systems for credit scoring using machine learning techniques. Ph.D. Thesis, Katholieke Universiteit Leuven, Belgium.

Chaudhuri

, Gupta

, Vamsi

, Bose

(2021). On the platform but will they buy? Predicting customers’ purchase behavior using deep learning. Decision Support Systems, 149: 113622. https://doi.org/10.1016/j.dss.2021.113622

Emtiyaz

, Keyvanpour

(2011). Customers behavior modeling by semi-supervised learning in customer relationship management. arXiv preprint: https://arxiv.org/abs/1201.1670.

Gönül

, Hofstede

(2006). How to compute optimal catalog mailing decisions. Marketing Science, 25(1): 65–74. Published online: January 1, 2006. https://doi.org/10.1287/mksc.1050.0136

Google LLC (2025). Google analytics. Web analytics platform.

Hanssens

, Leeflang

PSH

, Wittink

(2005). Market response models and marketing practice. UCLA Anderson School of Management.

Haron

NHB

(2022). Stratified sampling using cluster analysis. AIP Conference Proceedings, 2472(1): 050012.

Haughton

, Oulabi

(1993). Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 7(3): 16–26. 11 pages. https://doi.org/10.1002/dir.4000070305

, Garcia

(2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9): 1263–1284. https://doi.org/10.1109/TKDE.2008.239

Housden

, Thomas

(2002). Direct Marketing in Practice, 1st edition. Routledge, London. EBook published 27 April 2012.

Kang

, Cho

, MacLachlan

(2012). Improved response modeling based on clustering, under-sampling, and ensemble. Expert Systems with Applications, 39(8): 6738–6753. https://doi.org/10.1016/j.eswa.2011.12.028

Lee

, Shin

, Hwang

, Cho

, MacLachlan

(2010). Semi-supervised response modeling. Journal of Interactive Marketing, 24(1): 42–54. https://doi.org/10.1016/j.intmar.2009.10.004

Mohammed Amine Naji

, El Filali

, Aarika

, Benlahmar

, Ait Abdelouhahid

, Debauche

(2021). Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Computer Science, 191: 487–492. https://doi.org/10.1016/j.procs.2021.07.062

Moro

, Rita

, Cortez

(2014). Bank marketing. UCI Machine Learning Repository.

Sakar

, Kastro

(2018). Online shoppers purchasing intention dataset. UCI Machine Learning Repository.

Thomas

(2007). The end of mass marketing: Or, why all successful marketing is now direct marketing. Direct Marketing: An International Journal, 1(1): 6–16. https://doi.org/10.1108/17505930710734107

Tipton

(2013). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2): 109–139. https://doi.org/10.1177/0193841X13516324

Tékouabou

SCK

, Gherghina

, Toulni

, Neves Mata

, Mata

, Martins

(2022). A machine learning framework towards bank telemarketing prediction. Journal of Risk and Financial Management, 15(6): 269. https://doi.org/10.3390/jrfm15060269

Yan

, Nazmi

, Gebru

, et al. (2022). A clustering-based active learning method to query informative and representative samples. Applied Intelligence, 52: 13250–13267. https://doi.org/10.1007/s10489-021-03139-y