Skip to main content

Statistical Analysis and Modeling of Mass Spectrometry-Based Metabolomics Data

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1198))

Abstract

Multivariate statistical techniques are used extensively in metabolomics studies, ranging from biomarker selection to model building and validation. Two model independent variable selection techniques, principal component analysis and two sample t-tests are discussed in this chapter, as well as classification and regression models and model related variable selection techniques, including partial least squares, logistic regression, support vector machine, and random forest. Model evaluation and validation methods, such as leave-one-out cross-validation, Monte Carlo cross-validation, and receiver operating characteristic analysis, are introduced with an emphasis to avoid over-fitting the data. The advantages and the limitations of the statistical techniques are also discussed in this chapter.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. The NIST MS database. http://www.hmdb.ca/

  2. The Metlin Database. http://metlin.scripps.edu/index.php

  3. Gu H, Pan Z, Xi B, Hainline B, Shanaiah N, Asiago V, Gowda G, Raftery D (2009) 1H NMR metabolomics study of age profiling in children. NMR Biomed 22:826–833

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  4. Johnson R, Wichern DW (2002) Applied multivariate statistical analysis, 5th edn. Prentice-Hall, Englewood Cliffs, NJ

    Google Scholar 

  5. Nyamundanda G, Brennan L, Gormley IC (2010) Probabilistic principal component analysis for metabolomic data. BMC Bioinformatics 11:571

    Article  PubMed  PubMed Central  Google Scholar 

  6. Pan Z, Gu H, Talaty N, Chen H, Shanaiah N, Hainline BE, Cooks G, Raftery D (2007) Principal component analysis of urine metabolites detected by NMR and DESI–MS in patients with inborn errors of metabolism. Anal Bioanal Chem 387:539–549

    Article  PubMed  CAS  Google Scholar 

  7. Wiklund S, Johansson E, Sjstrm L, Mellerowicz EJ, Edlund U, Shockcor JP, Gottfries J, Moritz T, Trygg J (2008) Visualization of GC/TOF-MS-based metabolomics data for identification of biochemically interesting compounds using OPLS class models. Anal Chem 80:115–122

    Article  PubMed  CAS  Google Scholar 

  8. Wikoffa WR, Anforab AT, Liub J, Schultzb PG, Lesleyb SA, Petersb EC, Siuzdak G (2009) Metabolomics analysis reveals large effects of gut microflora on mammalian blood metabolites. Proc Natl Acad Sci U S A 106:3698–3703

    Article  Google Scholar 

  9. Gu H, Pan Z, Xi B, Asiago V, Musselman B, Raftery D (2011) Principal component directed partial least squares analysis for combining nuclear magnetic resonance and mass spectrometry data in metabolomics: application to the detection of breast cancer. Anal Chim Acta 686:57–63

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  10. Bretz F, Hothorn T, Westfall P (2011) Multiple comparisons using R. Chapman & Hall, New York

    Google Scholar 

  11. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B 57:289–300

    Google Scholar 

  12. Storey JD (2002) A direct approach to false discovery rates. J Royal Stat Soc Ser B 64:479–498

    Article  Google Scholar 

  13. Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188

    Article  Google Scholar 

  14. Bender R, Lange S (2001) Adjusting for multiple testing—when and how? J Clin Epidemiol 54:343–349

    Article  PubMed  CAS  Google Scholar 

  15. Baniasadi H, Nagana Gowda GA, Gu H, Zeng A, Zhuang S, Skill N, Maluccio M, Raftery D (2013) Targeted metabolic profiling of hepatocellular carcinoma and hepatitis C using LC-MS/MS. Electrophoresis 34:2910–2917

    PubMed  CAS  Google Scholar 

  16. Wold S, Antti H, Lindgren F, Öhman J (1998) Orthogonal signal correction of near-infrared spectra. Chemom Intell Lab Sys 44:175–185

    Article  CAS  Google Scholar 

  17. Liao JG, Chin KV (2007) Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23:1945–1951

    Article  PubMed  CAS  Google Scholar 

  18. Sugimoto M, Wong DT, Hirayama A, Soga T, Tomita M (2010) Capillary electrophoresis mass spectrometry-based saliva metabolomics identified oral, breast and pancreatic cancer-specific profiles. Metabolomics 6:78–95

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  19. Park MY, Hastie T (2008) Penalized logistic regression for detecting gene interactions. Biostatistics 9:30–50

    Article  PubMed  Google Scholar 

  20. R package stepPlr. http://cran.r-project.org/web/packages/stepPlr/

  21. Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning 20:273–297

    Google Scholar 

  22. Steinwart I, Christmann C (2008) Support vector machine. Springer, New York

    Google Scholar 

  23. Mahadevan S, Shah SL, Marrie TJ, Slupsky CM (2008) Analysis of metabolomic data using support vector machines. Anal Chem 80:7562–7570

    Article  PubMed  CAS  Google Scholar 

  24. Zhu J, Rosset S, Hastie T, Tibshirani R (2004) 1-Norm support vector machines. Adv Neural Inf Process Syst 16:49–56

    Google Scholar 

  25. Chang CC, Lin CJ (2001) libsvm: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  26. Weka: data mining software in Java. http://www.cs.waikato.ac.nz/ml/weka/

  27. R package e1071. http://cran.r-project.org/web/packages/e1071/

  28. Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers 10:61–74

    Google Scholar 

  29. Breiman L (2001) Random forests. Machine Learning 45:5–32

    Article  Google Scholar 

  30. West PR, Weir AM, Smith AM, Donley EL, Cezar GG (2010) Predicting human developmental toxicity of pharmaceuticals using human embryonic stem cells and metabolomics. Toxicol Appl Pharmacol 247:18–27

    Article  PubMed  CAS  Google Scholar 

  31. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York

    Book  Google Scholar 

  32. R package randomForest. http://cran.r-project.org/web/packages/randomForest/

  33. Carrola J, Rocha CM, Barros AS, Gil AM, Goodfellow BJ, Carreira IM, Bernardo J, Gomes A, Sousa V, Carvalho L, Duarte IF (2011) Metabolic signatures of lung cancer in biofluids: NMR-based metabonomics of urine. J Proteome Res 10:221–230

    Article  PubMed  CAS  Google Scholar 

  34. Molinaro AM, Simon R, Pfeiffer PM (2005) Prediction error estimation: a comparison of resampling methods. Bioinformatics 21:3301–3307

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgments

This article was written while one of the authors, Bowei Xi, was on sabbatical leave at the Statistical and Applied Mathematical Sciences Institute (SAMSI, Research Triangle Park, NC). This work is partially funded by NSF DMS-1228348, ARO W911NF-12-1-0558, DoD MURI W911NF-08-1-0238 (BX) and NIH R01GM085291 (DR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bowei Xi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this protocol

Cite this protocol

Xi, B., Gu, H., Baniasadi, H., Raftery, D. (2014). Statistical Analysis and Modeling of Mass Spectrometry-Based Metabolomics Data. In: Raftery, D. (eds) Mass Spectrometry in Metabolomics. Methods in Molecular Biology, vol 1198. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-1258-2_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1258-2_22

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-1257-5

  • Online ISBN: 978-1-4939-1258-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics