IDEAS home Printed from https://ideas.repec.org/a/taf/japsta/v47y2020i13-15p2312-2327.html
   My bibliography  Save this article

Ordered quantile normalization: a semiparametric transformation built for the cross-validation era

Author

Listed:
  • Ryan A. Peterson
  • Joseph E. Cavanaugh

Abstract

Normalization transformations have recently experienced a resurgence in popularity in the era of machine learning, particularly in data preprocessing. However, the classical methods that can be adapted to cross-validation are not always effective. We introduce Ordered Quantile (ORQ) normalization, a one-to-one transformation that is designed to consistently and effectively transform a vector of arbitrary distribution into a vector that follows a normal (Gaussian) distribution. In the absence of ties, ORQ normalization is guaranteed to produce normally distributed transformed data. Once trained, an ORQ transformation can be readily and effectively applied to new data. We compare the effectiveness of the ORQ technique with other popular normalization methods in a simulation study where the true data generating distributions are known. We find that ORQ normalization is the only method that works consistently and effectively, regardless of the underlying distribution. We also explore the use of repeated cross-validation to identify the best normalizing transformation when the true underlying distribution is unknown. We apply our technique and other normalization methods via the bestNormalize R package on a car pricing data set. We built bestNormalize to evaluate the normalization efficacy of many candidate transformations; the package is freely available via the Comprehensive R Archive Network.

Suggested Citation

  • Ryan A. Peterson & Joseph E. Cavanaugh, 2020. "Ordered quantile normalization: a semiparametric transformation built for the cross-validation era," Journal of Applied Statistics, Taylor & Francis Journals, vol. 47(13-15), pages 2312-2327, November.
  • Handle: RePEc:taf:japsta:v:47:y:2020:i:13-15:p:2312-2327
    DOI: 10.1080/02664763.2019.1630372
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1080/02664763.2019.1630372
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1080/02664763.2019.1630372?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Ryan A. Peterson & Joseph E. Cavanaugh, 2022. "Ranked sparsity: a cogent regularization framework for selecting and estimating feature interactions and polynomials," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 106(3), pages 427-454, September.
    2. Takeshi Matsui & Martin N. Mullis & Kevin R. Roy & Joseph J. Hale & Rachel Schell & Sasha F. Levy & Ian M. Ehrenreich, 2022. "The interplay of additivity, dominance, and epistasis on fitness in a diploid yeast cross," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    3. Newhouse,David Locke & Merfeld,Joshua David & Ramakrishnan,Anusha Pudugramam & Swartz,Tom & Lahiri,Partha, 2022. "Small Area Estimation of Monetary Poverty in Mexico Using Satellite Imagery and Machine Learning," Policy Research Working Paper Series 10175, The World Bank.
    4. Maëva Labouyrie & Cristiano Ballabio & Ferran Romero & Panos Panagos & Arwyn Jones & Marc W. Schmid & Vladimir Mikryukov & Olesya Dulya & Leho Tedersoo & Mohammad Bahram & Emanuele Lugato & Marcel G. , 2023. "Patterns in soil microbial diversity across Europe," Nature Communications, Nature, vol. 14(1), pages 1-21, December.
    5. Oguz Turkozan & Vasiliki Almpanidou & Can Yılmaz & Antonios D. Mazaris, 2021. "Extreme thermal conditions in sea turtle nests jeopardize reproductive output," Climatic Change, Springer, vol. 167(3), pages 1-16, August.
    6. Barker, Justin R. & MacIsaac, Hugh J., 2022. "Species distribution models: Administrative boundary centroid occurrences require careful interpretation," Ecological Modelling, Elsevier, vol. 472(C).
    7. Celeste McCracken & Zahra Raisi-Estabragh & Michele Veldsman & Betty Raman & Andrea Dennis & Masud Husain & Thomas E. Nichols & Steffen E. Petersen & Stefan Neubauer, 2022. "Multi-organ imaging demonstrates the heart-brain-liver axis in UK Biobank participants," Nature Communications, Nature, vol. 13(1), pages 1-11, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:taf:japsta:v:47:y:2020:i:13-15:p:2312-2327. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chris Longhurst (email available below). General contact details of provider: http://www.tandfonline.com/CJAS20 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.