IDEAS home Printed from https://ideas.repec.org/a/inm/orijds/v3y2024i2p124-144.html
   My bibliography  Save this article

An Optimization-Based Order-and-Cut Approach for Fair Clustering of Data Sets

Author

Listed:
  • Su Li

    (Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843)

  • Hrayer Aprahamian

    (Department of Industrial and Systems Engineering, Texas A&M University, College Station, Texas 77843)

  • Maher Nouiehed

    (Department of Industrial Engineering and Management, American University of Beirut, Beirut 1107 2020, Lebanon)

  • Hadi El-Amine

    (Department of Systems Engineering and Operations Research, George Mason University, Fairfax, Virginia 22030)

Abstract

Machine learning algorithms have been increasingly integrated into applications that significantly affect human lives. This surged an interest in designing algorithms that train machine learning models to minimize training error and imposing a certain level of fairness. In this paper, we consider the problem of fair clustering of data sets. In particular, given a set of items each associated with a vector of nonsensitive attribute values and a categorical sensitive attribute (e.g., gender, race, etc.), our goal is to find a clustering of the items that minimizes the loss (i.e., clustering objective) function and imposes fairness measured by Rényi correlation. We propose an efficient and scalable in-processing algorithm, driven by findings from the field of combinatorial optimization, that heuristically solves the underlying optimization problem and allows for regulating the trade-off between clustering quality and fairness. The approach does not restrict the analysis to a specific loss function, but instead considers a more general form that satisfies certain desirable properties. This broadens the scope of the algorithm’s applicability. We demonstrate the effectiveness of the algorithm for the specific case of k -means clustering as it is one of the most extensively studied and widely adopted clustering schemes. Our numerical experiments reveal the proposed algorithm significantly outperforms existing methods by providing a more effective mechanism to regulate the trade-off between loss and fairness.

Suggested Citation

  • Su Li & Hrayer Aprahamian & Maher Nouiehed & Hadi El-Amine, 2024. "An Optimization-Based Order-and-Cut Approach for Fair Clustering of Data Sets," INFORMS Joural on Data Science, INFORMS, vol. 3(2), pages 124-144, October.
  • Handle: RePEc:inm:orijds:v:3:y:2024:i:2:p:124-144
    DOI: 10.1287/ijds.2022.0005
    as

    Download full text from publisher

    File URL: http://dx.doi.org/10.1287/ijds.2022.0005
    Download Restriction: no

    File URL: https://libkey.io/10.1287/ijds.2022.0005?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Hadi El-Amine & Hrayer Aprahamian, 2022. "A heuristic scheme for multivariate set partitioning problems with application to classifying heterogeneous populations for multiple binary attributes," IISE Transactions, Taylor & Francis Journals, vol. 54(6), pages 537-549, June.
    2. Shmuel Onn & Leonard J. Schulman, 2001. "The Vector Partition Problem for Convex Objective Functions," Mathematics of Operations Research, INFORMS, vol. 26(3), pages 583-590, August.
    3. Hrayer Aprahamian & Douglas R. Bish & Ebru K. Bish, 2019. "Optimal Risk-Based Group Testing," Management Science, INFORMS, vol. 65(9), pages 4365-4384, September.
    4. A. K. Chakravarty & J. B. Orlin & U. G. Rothblum, 1982. "Technical Note—A Partitioning Problem with Additive Objective with an Application to Optimal Inventory Groupings for Joint Replenishment," Operations Research, INFORMS, vol. 30(5), pages 1018-1022, October.
    5. A. K. Chakravarty & J. B. Orlin & U. G. Rothblum, 1985. "Consecutive Optimizers for a Partitioning Problem with Applications to Optimal Inventory Groupings for Joint Replenishment," Operations Research, INFORMS, vol. 33(4), pages 820-834, August.
    6. Shmuel Gal & Boris Klots, 1995. "Optimal Partitioning Which Maximizes the Sum of the Weighted Averages," Operations Research, INFORMS, vol. 43(3), pages 500-508, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Frank K. Hwang & Uriel G. Rothblum, 2011. "On the number of separable partitions," Journal of Combinatorial Optimization, Springer, vol. 21(4), pages 423-433, May.
    2. Chung‐Lun Li & Zhi‐Long Chen, 2006. "Bin‐packing problem with concave costs of bin utilization," Naval Research Logistics (NRL), John Wiley & Sons, vol. 53(4), pages 298-308, June.
    3. Gerard J. Chang & Fu-Loong Chen & Lingling Huang & Frank K. Hwang & Su-Tzu Nuan & Uriel G. Rothblum & I-Fan Sun & Jan-Wen Wang & Hong-Gwa Yeh, 1998. "Sortabilities of Partition Properties," Journal of Combinatorial Optimization, Springer, vol. 2(4), pages 413-427, December.
    4. Hrayer Aprahamian & Hadi El-Amine, 2022. "Optimal Screening of Populations with Heterogeneous Risk Profiles Under the Availability of Multiple Tests," INFORMS Journal on Computing, INFORMS, vol. 34(1), pages 150-164, January.
    5. Hussein El Hajj & Douglas R. Bish & Ebru K. Bish & Denise M. Kay, 2022. "Novel Pooling Strategies for Genetic Testing, with Application to Newborn Screening," Management Science, INFORMS, vol. 68(11), pages 7994-8014, November.
    6. Jia Shu & Chung-Piaw Teo & Zuo-Jun Max Shen, 2005. "Stochastic Transportation-Inventory Network Design Problem," Operations Research, INFORMS, vol. 53(1), pages 48-60, February.
    7. Huilan Chang & Frank K. Hwang & Uriel G. Rothblum, 2012. "A new approach to solve open-partition problems," Journal of Combinatorial Optimization, Springer, vol. 23(1), pages 61-78, January.
    8. Thomas Mariotti & Nikolaus Schweizer & Nora Szech & Jonas von Wangenheim, 2023. "Information Nudges and Self-Control," Management Science, INFORMS, vol. 69(4), pages 2182-2197, April.
    9. Frank K. Hwang & Shmuel Onn & Uriel G. Rothblum, 2000. "Explicit solution of partitioning problems over a 1‐dimensional parameter space," Naval Research Logistics (NRL), John Wiley & Sons, vol. 47(6), pages 531-540, September.
    10. Yann Braouezec, 2013. "The Welfare Effects of Regulating the Number of Market Segments," Working Papers 2013-ECO-11, IESEG School of Management.
    11. Braouezec, Yann, 2016. "On the welfare effects of regulating the number of discriminatory prices," Research in Economics, Elsevier, vol. 70(4), pages 588-607.
    12. Yaakov Malinovsky, 2019. "Sterrett Procedure for the Generalized Group Testing Problem," Methodology and Computing in Applied Probability, Springer, vol. 21(3), pages 829-840, September.
    13. Sheikh-Zadeh, Alireza & Rossetti, Manuel D. & Scott, Marc A., 2021. "Performance-based inventory classification methods for large-Scale multi-echelon replenishment systems," Omega, Elsevier, vol. 101(C).
    14. Siddhartha Syam & Bala Shetty, 1998. "Coordinated replenishments from multiple suppliers with price discounts," Naval Research Logistics (NRL), John Wiley & Sons, vol. 45(6), pages 579-598, September.
    15. Ogawa, Sanae & Ohta, Hiroshi, 1995. "Common order cycle system for multi-item inventory model with learning in ordering and transportation," International Journal of Production Economics, Elsevier, vol. 41(1-3), pages 321-325, October.
    16. Hrayer Aprahamian & Douglas R. Bish & Ebru K. Bish, 2020. "Optimal Group Testing: Structural Properties and Robust Solutions, with Application to Public Health Screening," INFORMS Journal on Computing, INFORMS, vol. 32(4), pages 895-911, October.
    17. Borgwardt, S. & Brieden, A. & Gritzmann, P., 2017. "An LP-based k-means algorithm for balancing weighted point sets," European Journal of Operational Research, Elsevier, vol. 263(2), pages 349-355.
    18. Amiya K. Chakravarty & G. E. Martin, 1989. "Discount pricing policies for inventories subject to declining demand," Naval Research Logistics (NRL), John Wiley & Sons, vol. 36(1), pages 89-102, February.
    19. Arbib, Claudio & Rossi, Fabrizio, 2000. "An optimization problem arising in the design of multiring systems," European Journal of Operational Research, Elsevier, vol. 124(1), pages 63-76, July.
    20. Wildeman, R. E. & Dekker, R. & Smit, A. C. J. M., 1997. "A dynamic policy for grouping maintenance activities," European Journal of Operational Research, Elsevier, vol. 99(3), pages 530-551, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inm:orijds:v:3:y:2024:i:2:p:124-144. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Asher (email available below). General contact details of provider: https://edirc.repec.org/data/inforea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.