Label-efficient Response Modelling: Cost-Effective Marketing Using Cluster-Based Active Sampling
Pub. online: 3 September 2025
Type: Data Science In Action
Open Access
Received
16 April 2025
16 April 2025
Accepted
18 July 2025
18 July 2025
Published
3 September 2025
3 September 2025
Abstract
This paper introduces a label-efficient response modelling method useful when the target labels are unknown a priori. Unlike most response modelling methods that adopt a supervised or semi-supervised approach, we apply clustering to partition data into homogeneous segments, which are assumed to reflect the underlying response behaviours. We then take a random sample from each cluster. For each sampled record, the true target label is acquired. Through this cluster-based stratified sampling approach, we reduced the cost of label acquisition needed to estimate the cluster-specific and overall basic response rates. The goal is to identify a subset of the population more likely to respond (e.g., make a purchase) while controlling campaign costs. This idea of subsetting the population represents a departure from conventional classification tasks, which require full labeling of all observations. We regard clusters with response rates significantly higher than the estimated basic response rate as high-propensity clusters and proceed to acquire all their remaining labels. Our experimental results show that the response rates of high-propensity clusters are at least 1.7 times the basic response rate. This suggests that the proposed approach significantly reduces costs by targeting only high-propensity groups and is useful in scenarios lacking historical ground truth.
Supplementary material
Supplementary MaterialThe Python notebook containing the implementation of the proposed method is available at the following link: https://drive.google.com/drive/folders/1WE8A0aZ-cKLJ45hRDMFH20CczwZ2wiWh?usp=sharing.
References
Ali A, Abd Razak S, Othman SH, Eisa TAE, Al-Dhaqm A, Nasser M, et al. (2022). Financial fraud detection based on machine learning: A systematic literature review. Applied Sciences, 12(19): 9637. https://doi.org/10.3390/app12199637
Chaudhuri N, Gupta G, Vamsi V, Bose I (2021). On the platform but will they buy? Predicting customers’ purchase behavior using deep learning. Decision Support Systems, 149: 113622. https://doi.org/10.1016/j.dss.2021.113622
Emtiyaz S, Keyvanpour M (2011). Customers behavior modeling by semi-supervised learning in customer relationship management. arXiv preprint: https://arxiv.org/abs/1201.1670.
Gönül FF, Hofstede FT (2006). How to compute optimal catalog mailing decisions. Marketing Science, 25(1): 65–74. Published online: January 1, 2006. https://doi.org/10.1287/mksc.1050.0136
Haughton D, Oulabi S (1993). Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 7(3): 16–26. 11 pages. https://doi.org/10.1002/dir.4000070305
He H, Garcia EA (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9): 1263–1284. https://doi.org/10.1109/TKDE.2008.239
Kang P, Cho S, MacLachlan DL (2012). Improved response modeling based on clustering, under-sampling, and ensemble. Expert Systems with Applications, 39(8): 6738–6753. https://doi.org/10.1016/j.eswa.2011.12.028
Lee HJ, Shin H, Hwang SS, Cho S, MacLachlan D (2010). Semi-supervised response modeling. Journal of Interactive Marketing, 24(1): 42–54. https://doi.org/10.1016/j.intmar.2009.10.004
Mohammed Amine Naji S, El Filali S, Aarika K, Benlahmar EH, Ait Abdelouhahid R, Debauche O (2021). Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Computer Science, 191: 487–492. https://doi.org/10.1016/j.procs.2021.07.062
Thomas AR (2007). The end of mass marketing: Or, why all successful marketing is now direct marketing. Direct Marketing: An International Journal, 1(1): 6–16. https://doi.org/10.1108/17505930710734107
Tipton E (2013). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2): 109–139. https://doi.org/10.1177/0193841X13516324
Tékouabou SCK, Gherghina SC, Toulni H, Neves Mata P, Mata MN, Martins JM (2022). A machine learning framework towards bank telemarketing prediction. Journal of Risk and Financial Management, 15(6): 269. https://doi.org/10.3390/jrfm15060269
Yan X, Nazmi S, Gebru B, et al. (2022). A clustering-based active learning method to query informative and representative samples. Applied Intelligence, 52: 13250–13267. https://doi.org/10.1007/s10489-021-03139-y