用户名: 密码: 验证码:
Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance
详细信息    查看全文
  • 作者:Antonio Irpino (1)
    Rosanna Verde (1)

    1. Department of Political Sciences 鈥淛. Monnet鈥? Second University of Naples
    ; Viale Ellittico ; 31 ; 81100 ; Caserta ; Italy
  • 关键词:Modal symbolic variables ; Probability distribution function ; Histogram data ; Regression ; Wasserstein distance ; 62J05 ; 62G30 ; 46F10
  • 刊名:Advances in Data Analysis and Classification
  • 出版年:2015
  • 出版时间:March 2015
  • 年:2015
  • 卷:9
  • 期:1
  • 页码:81-106
  • 全文大小:272 KB
  • 参考文献:1. Arroyo, J, Mat茅, C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25: pp. 192-207 CrossRef
    2. Bertrand, P, Goupil, F Descriptive statistics for symbolic data. In: Bock, HH, Diday, E eds. (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp. 103-124
    3. Bickel, P, Freedman, D (1981) Some asymptotic theory for the bootstrap. Ann Stat 9: pp. 1196-1217 CrossRef
    4. Billard L, Diday E (2000) Regression analysis for interval-valued data. In: Data analysis, classification and related methods: proceedings of the seventh conference of the IFCS, Springer, Berlin, pp 369鈥?74
    5. Billard, L, Diday, E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, New York CrossRef
    6. Bock, H, Diday, E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin CrossRef
    7. Dall鈥橝glio, G (1956) Sugli estremi dei momenti delle funzioni di ripartizione doppia. Ann Sci Norm Super Di Pisa Cl Sci 3: pp. 3374
    8. DiasS, Brito P (2011) A new linear regression model for histogram-valued variables. In: 58th ISI world statistics congress, Dublin, Ireland. http://isi2011.congressplanner.eu/pdfs/950662
    9. Diday, E, Noirhomme-Fraiture, M (2008) Symbolic data analysis and the SODAS software. Wiley, New York
    10. Due帽as C, Fern谩ndez MC, Ca帽ete S, Carretero J, Liger E (2002) Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean coast. Sci Total Environ 299(1鈥?):97鈥?13
    11. Efron, B, Tibshirani, RJ (1993) An introduction to the bootstrap. Chapman and Hall, New York CrossRef
    12. Gilchrist, WG (2000) Statistical modelling with quantile functions. Chapman and Hall/CRC, New York CrossRef
    13. Gini C (1914) Di una misura della dissomiglianza tra due gruppi di quantit e delle sue applicazioni allo studio delle relazioni stratistiche. Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, Tomo LXXIV parte seconda (1914)
    14. Giordani P (2011) Linear regression analysis for interval-valued data based on the lasso technique. Techchnical repor 6, Diploma of Statistical Sciences, Sapienza University of Rome
    15. Irpino A, Romano E (2007) Optimal histogram representation of large data sets: fisher vs piecewise linear approximation. In: Noirhomme-Fraiture M, Venturini G (eds) EGC, C茅padu猫s-脡ditions, Revue des Nouvelles Technologies de l鈥橧nformation, vol RNTI-E-9, pp 99鈥?10
    16. Irpino A, Verde R, Lechevallier Y (2006) Dynamic clustering of histograms using Wasserstein metric. In: COMPSTAT, pp 869鈥?76
    17. Irpino A, Verde R (2006) A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Batagelj V, Bock HH, Ferligoj A, 沤iberna A (eds) Data science and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 20, pp 185鈥?92
    18. Irpino, A, Verde, R (2008) Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recognit Lett 29: pp. 1648-1658 CrossRef
    19. Kantorovich, L (1940) On one effective method of solving certain classes of extremal problems. Dokl Akad Nauk 28: pp. 212-215
    20. Lawson, CL, Hanson, RJ (1974) Solving least square problems. Prentice Hall, Edgeworth Cliff
    21. Mallows, CL (1972) A note on asymptotic joint normality. Ann Math Stat 43: pp. 508-515 CrossRef
    22. Neto EAL, de Carvalho FAT, Tenorio CP (2004) Univariate and multivariate linear regression methods to predict interval-valued features. In: Australian cconference on artificial intelligence, pp 526鈥?37
    23. Neto, EAL, Carvalho, FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52: pp. 1500-1515 CrossRef
    24. Neto, EAL, Carvalho, FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54: pp. 333-347 CrossRef
    25. Noirhomme-Fraiture, M, Brito, P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4: pp. 157-170 CrossRef
    26. Salvemini T (1943) Sul calcolo degli indici di concordanza tra due caratteri quantitativi. In: Atti della VI Riunione della Soc Ital di Statistica, Roma (1943)
    27. Tibshirani, R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58: pp. 267-288
    28. Verde R, Irpino A (2008) Comparing histogram data using a Mahalanobis-Wasserstein distance. In: Brito P (ed) COMPSTAT 2008, Physica, Heidelberg, 7, 77鈥?9
    29. Verde R, Irpino A (2007) Dynamic clustering of histogram data: Using the right metric. In: Brito P, Cucumel G, Bertrand P, Carvalho F (eds) Selected contributions in data analysis and classification, studies in classification, data analysis, and knowledge organization, Springer, Berlin, 12, 123鈥?34 (2007)
    30. Verde R, Irpino A (2010) Ordinary least squares for histogram data based on Wasserstein distance. In: Lechevallier Y, Saporta G (eds) In: Proceedings of COMPSTAT鈥?010, vol. 60, pp. 581鈥?88. Physica, Heidelberg (2010)
    31. Wasserstein, L (1969) Markov processes over denumerable products of spaces describing large systems of automata. Prob Inf Trans 5: pp. 47-52
  • 刊物类别:Mathematics and Statistics
  • 刊物主题:Mathematics
    Statistics
    Statistical Theory and Methods
    Statistics for Business, Economics, Mathematical Finance and Insurance
    Statistics for Life Sciences, Medicine and Health Sciences
    Statistics for Engineering, Physics, Computer Science, Chemistry and Geosciences
    Statistics for Social Science, Behavorial Science, Education, Public Policy and Law
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1862-5355
文摘
In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the \(\ell _2\) Wasserstein distance is proposed. Some properties of such a metric are exploited to predict the modal response variable as a linear combination of the explanatory modal variables. Based on the metric, the model uses the quantile functions associated with the data and thus is subject to a positivity constraint of the estimated parameters. We propose solving the linear regression problem by starting from a particular decomposition of the squared distance. Therefore, we estimate the model parameters according to two separate models, one for the averages of the data and one for the centered distributions by a constrained least squares algorithm. Measures of goodness-of-fit are also proposed and discussed. The method is validated by two applications, one on simulated data and one on two real-world datasets.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700