用户名: 密码: 验证码:
Population recovery and partial identification
详细信息    查看全文
  • 作者:Avi Wigderson ; Amir Yehudayoff
  • 关键词:Noise recovery ; Identification ; Partial information
  • 刊名:Machine Learning
  • 出版年:2016
  • 出版时间:January 2016
  • 年:2016
  • 卷:102
  • 期:1
  • 页码:29-56
  • 全文大小:564 KB
  • 参考文献:Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. ACM SIGMOD Record, 29(2), 439–450.CrossRef
    Batman, L., Impagliazzo, R., Murray, C., & Paturi, R. (2013). Finding heavy hitters from partial or noisy data. In APPROX-RANDOM, 2013 (pp. 347–362).
    Beck, L. (1980). A security mechanism for statistical data bases. ACM Transactions of Databases, 5(3), 316–338.MATH CrossRef
    Blum, A. (1994). Relevant examples and relevant features: Thoughts from computational learning theory. In AAAI fall symposium on ’relevance’.
    Blum, A., Coja-Oghlan, A., Frieze, A., & Zhou, S. (2009). Separating populations with wide data: A spectral analysis. Electronic Journal of Statistics, 3, 76–113.MATH MathSciNet CrossRef
    Dinur, I., Nissim, K. (2003). Revealing information while preserving privacy. In PODS (pp. 202–210).
    Dvir, Z., Rao, A., Wigderson, A., Yehudayoff, A. (2012). Restriction access. In ITCS (pp. 19–33).
    Dwork, C., McSherry, F., Nissim, K., Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In TCC (pp. 265–284).
    Dwork, C., Smith, A. (2008). Differential privacy for statistics: What we know and what we want to learn. In NCHS/CDC data confidentiality workshop.
    Feldman, J., O’Donnell, R., & Servedio, R. (2008). Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5), 1536–1564.MATH MathSciNet CrossRef
    Floyd, S., & Warmuth, M. K. (1995). Sample compression, learnability, and the Vapnik–Chervonenkis dimension. Machine Learning, 21(3), 269–304.
    Goldreich, O., Levin, L. (1989). A generic hardcore predicate for any one-way function. In STOC (pp. 25–30).
    Holenstein, T., Mitzenmacher, M., Panigrahy, R., Wieder, U. (2008). Trace reconstruction with constant deletion probability and related results. In SODA (pp. 389–398).
    Johnson, W., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26, 189–206.MATH MathSciNet CrossRef
    Kalai, A., Moitra, A., & Valiant, G. (2012). Disentangling Gaussians. Communications of the ACM, 55(2), 113–120.CrossRef
    Kearns, M., Mansour, Y., Ron, D., Rubinfel, R., Schapire, R. E., Sellie, L. (1994). On the learnability of discrete distributions. In STOC (pp. 273–282).
    Kushilevitz, E., & Mansour, Y. (1993). Learning decision trees using the fourier spectrum. SIAM Journal of Computing, 22(6), 1331–1348.MATH MathSciNet CrossRef
    Lefons, E., Silvestri, A., Tangorra, F. (1983). An analytic approach to statistical databases. In International conference on very large data bases (pp. 260–274).
    Liew, C. K., Choi, U. J., & Liew, C. J. (1999). A data distortion by probability distribution. Communications of the ACM, 42(10), 89–94.
    Littlestone, N., Warmuth, M.K. (1986). Relating data compression and learnability. Unpublished, obtainable at http://​www.​cse.​ucsc.​edu/​manfred/​pubs/​T1 .
    Lovett, S., Zhang, J. (2015). Improved noisy population recovery, and reverse Bonami–Beckner inequality for sparse functions. In STOC.
    Matsen, F. A., Mossel, E., & Steel, M. (2008). Mixed-up trees: The structure of phylogenetic mixtures. Bulletin of Mathematical Biology, 70(4), 1115–1139.MATH MathSciNet CrossRef
    Moitra, A., Saks, M. (2013). A polynomial time algorithm for lossy population recovery. In CoRR  arXiv:​1302.​1515
    Mossell, E. (2008). The subsequence problem. Unpublished.
    Nash-Williams, C. (1978). The reconstruction problem. In Selected topics in graph theory I. Academic Press: New York.
    Traub, J., Yemini, Y., & Wozniakowski, H. (1984). The statistical security of a statistical database. ACM Transactions on Database Systems, 9(4), 672–679.CrossRef
    Warner, S. L. (1965). Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60, 63–69.MATH CrossRef
  • 作者单位:Avi Wigderson (1)
    Amir Yehudayoff (2)

    1. Institute for Advanced Study, Princeton, NJ, USA
    2. Technion–IIT, Haifa, Israel
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Automation and Robotics
    Computing Methodologies
    Simulation and Modeling
    Language Translation and Linguistics
  • 出版者:Springer Netherlands
  • ISSN:1573-0565
文摘
We study several problems in which an unknown distribution over an unknown population of vectors needs to be recovered from partial or noisy samples, each of which nearly completely erases or obliterates the original vector. For example, consider a distribution \(p\) over a population \(V \subseteq \{0,1\}^n\). A noisy sample \(v'\) is obtained by choosing \(v\) according to \(p\) and flipping each coordinate of \(v\) with probability say 0.49 independently. The problem is to recover \(V,p\) as efficiently as possible from noisy samples. Such problems naturally arise in a variety of contexts in learning, clustering, statistics, computational biology, data mining and database privacy, where loss and error may be introduced by nature, inaccurate measurements, or on purpose. We give fairly efficient algorithms to recover the data under fairly general assumptions. Underlying our algorithms is a new structure we call a partial identification (PID) graph for an arbitrary finite set of vectors over any alphabet. This graph captures the extent to which certain subsets of coordinates in each vector distinguish it from other vectors. PID graphs yield strategies for dimension reductions and re-assembly of statistical information, and so may be useful in other applications as well. The quality of our algorithms (sequential and parallel runtime, as well as numerical stability) critically depends on three parameters of PID graphs: width, depth and cost. The combinatorial heart of this work is showing that every set of vectors possesses a PID graph in which all three parameters are small (we prove some limitations on their trade-offs as well). We further give an efficient algorithm to find such near-optimal PID graphs for any set of vectors. Our efficient PID graphs imply general algorithms for these recovery problems, even when loss or noise are just below the information-theoretic limit. In the learning/clustering context this gives a new algorithm for learning mixtures of binomial distributions (with known marginals) whose running time depends only quasi-polynomially on the number of clusters. We discuss implications to privacy and coding as well.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700