IDEAS home Printed from https://ideas.repec.org/a/taf/jnlasa/v116y2021i534p887-900.html
   My bibliography  Save this article

Rare Feature Selection in High Dimensions

Author

Listed:
  • Xiaohan Yan
  • Jacob Bien

Abstract

It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers. Supplementary materials for this article are available online.

Suggested Citation

  • Xiaohan Yan & Jacob Bien, 2021. "Rare Feature Selection in High Dimensions," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 116(534), pages 887-900, April.
  • Handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:887-900
    DOI: 10.1080/01621459.2020.1796677
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/01621459.2020.1796677
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/01621459.2020.1796677?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Gen Li & Yan Li & Kun Chen, 2023. "It's all relative: Regression analysis with compositional predictors," Biometrics, The International Biometric Society, vol. 79(2), pages 1318-1329, June.
    2. Aaron J. Molstad & Keshav Motwani, 2023. "Multiresolution categorical regression for interpretable cell‐type annotation," Biometrics, The International Biometric Society, vol. 79(4), pages 3485-3496, December.
    3. Bingkai Wang & Brian S. Caffo & Xi Luo & Chin‐Fu Liu & Andreia V. Faria & Michael I. Miller & Yi Zhao & for the Alzheimer's Disease Neuroimaging Initiative*, 2022. "Regularized regression on compositional trees with application to MRI analysis," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(3), pages 541-561, June.
    4. Yi Zhao & Bingkai Wang & Chin‐Fu Liu & Andreia V. Faria & Michael I. Miller & Brian S. Caffo & Xi Luo, 2023. "Identifying brain hierarchical structures associated with Alzheimer's disease using a regularized regression method with tree predictors," Biometrics, The International Biometric Society, vol. 79(3), pages 2333-2345, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:jnlasa:v:116:y:2021:i:534:p:887-900. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/UASA20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.