Label-efficient Response Modelling: Cost-Effective Marketing Using Cluster-Based Active Sampling

Tan, Swee Chuan

doi:10.6339/25-JDS1198

Journal of Data Science

Label-efficient Response Modelling: Cost-Effective Marketing Using Cluster-Based Active Sampling

Swee Chuan Tan

https://doi.org/10.6339/25-JDS1198

Pub. online: 3 September 2025 Type: Data Science In Action

Open Access

Received
16 April 2025

Accepted
18 July 2025

Published
3 September 2025

Abstract

This paper introduces a label-efficient response modelling method useful when the target labels are unknown a priori. Unlike most response modelling methods that adopt a supervised or semi-supervised approach, we apply clustering to partition data into homogeneous segments, which are assumed to reflect the underlying response behaviours. We then take a random sample from each cluster. For each sampled record, the true target label is acquired. Through this cluster-based stratified sampling approach, we reduced the cost of label acquisition needed to estimate the cluster-specific and overall basic response rates. The goal is to identify a subset of the population more likely to respond (e.g., make a purchase) while controlling campaign costs. This idea of subsetting the population represents a departure from conventional classification tasks, which require full labeling of all observations. We regard clusters with response rates significantly higher than the estimated basic response rate as high-propensity clusters and proceed to acquire all their remaining labels. Our experimental results show that the response rates of high-propensity clusters are at least 1.7 times the basic response rate. This suggests that the proposed approach significantly reduces costs by targeting only high-propensity groups and is useful in scenarios lacking historical ground truth.

Supplementary material

Supplementary Material

The Python notebook containing the implementation of the proposed method is available at the following link: https://drive.google.com/drive/folders/1WE8A0aZ-cKLJ45hRDMFH20CczwZ2wiWh?usp=sharing.

References

Ali A, Abd Razak S, Othman SH, Eisa TAE, Al-Dhaqm A, Nasser M, et al. (2022). Financial fraud detection based on machine learning: A systematic literature review. Applied Sciences, 12(19): 9637. https://doi.org/10.3390/app12199637

Baesens B (2004). Developing intelligent systems for credit scoring using machine learning techniques. Ph.D. Thesis, Katholieke Universiteit Leuven, Belgium.

Chaudhuri N, Gupta G, Vamsi V, Bose I (2021). On the platform but will they buy? Predicting customers’ purchase behavior using deep learning. Decision Support Systems, 149: 113622. https://doi.org/10.1016/j.dss.2021.113622

Emtiyaz S, Keyvanpour M (2011). Customers behavior modeling by semi-supervised learning in customer relationship management. arXiv preprint: https://arxiv.org/abs/1201.1670.

Gönül FF, Hofstede FT (2006). How to compute optimal catalog mailing decisions. Marketing Science, 25(1): 65–74. Published online: January 1, 2006. https://doi.org/10.1287/mksc.1050.0136

Google LLC (2025). Google analytics. Web analytics platform.

Hanssens DM, Leeflang PSH, Wittink DR (2005). Market response models and marketing practice. UCLA Anderson School of Management.

Haron NHB (2022). Stratified sampling using cluster analysis. AIP Conference Proceedings, 2472(1): 050012.

Haughton D, Oulabi S (1993). Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 7(3): 16–26. 11 pages. https://doi.org/10.1002/dir.4000070305

He H, Garcia EA (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9): 1263–1284. https://doi.org/10.1109/TKDE.2008.239

Housden M, Thomas B (2002). Direct Marketing in Practice, 1st edition. Routledge, London. EBook published 27 April 2012.

Kang P, Cho S, MacLachlan DL (2012). Improved response modeling based on clustering, under-sampling, and ensemble. Expert Systems with Applications, 39(8): 6738–6753. https://doi.org/10.1016/j.eswa.2011.12.028

Lee HJ, Shin H, Hwang SS, Cho S, MacLachlan D (2010). Semi-supervised response modeling. Journal of Interactive Marketing, 24(1): 42–54. https://doi.org/10.1016/j.intmar.2009.10.004

Mohammed Amine Naji S, El Filali S, Aarika K, Benlahmar EH, Ait Abdelouhahid R, Debauche O (2021). Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Computer Science, 191: 487–492. https://doi.org/10.1016/j.procs.2021.07.062

Moro S, Rita P, Cortez P (2014). Bank marketing. UCI Machine Learning Repository.

Sakar C, Kastro Y (2018). Online shoppers purchasing intention dataset. UCI Machine Learning Repository.

Thomas AR (2007). The end of mass marketing: Or, why all successful marketing is now direct marketing. Direct Marketing: An International Journal, 1(1): 6–16. https://doi.org/10.1108/17505930710734107

Tipton E (2013). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2): 109–139. https://doi.org/10.1177/0193841X13516324

Tékouabou SCK, Gherghina SC, Toulni H, Neves Mata P, Mata MN, Martins JM (2022). A machine learning framework towards bank telemarketing prediction. Journal of Risk and Financial Management, 15(6): 269. https://doi.org/10.3390/jrfm15060269

Yan X, Nazmi S, Gebru B, et al. (2022). A clustering-based active learning method to query informative and representative samples. Applied Intelligence, 52: 13250–13267. https://doi.org/10.1007/s10489-021-03139-y

2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

active learning data-efficient learning imbalanced data predictive modelling semi-supervised learning stratified sampling

Metrics

since February 2021

982

Article info
views

1150

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file