chi^2 testing of Zipf generator questions Bradley Lucier (07 Feb 2024 02:26 UTC)
|
Re: chi^2 testing of Zipf generator questions
Bradley Lucier
(07 Feb 2024 04:07 UTC)
|
Re: chi^2 testing of Zipf generator questions
Linas Vepstas
(07 Feb 2024 04:41 UTC)
|
Re: chi^2 testing of Zipf generator questions
Arthur A. Gleckler
(07 Feb 2024 04:54 UTC)
|
Re: chi^2 testing of Zipf generator questions
Linas Vepstas
(07 Feb 2024 05:16 UTC)
|
Re: chi^2 testing of Zipf generator questions
Arthur A. Gleckler
(07 Feb 2024 05:23 UTC)
|
Re: chi^2 testing of Zipf generator questions
Linas Vepstas
(07 Feb 2024 05:46 UTC)
|
Re: chi^2 testing of Zipf generator questions
Arthur A. Gleckler
(07 Feb 2024 05:51 UTC)
|
Re: chi^2 testing of Zipf generator questions
Linas Vepstas
(07 Feb 2024 05:58 UTC)
|
Re: chi^2 testing of Zipf generator questions
Marc Nieper-Wißkirchen
(07 Feb 2024 16:08 UTC)
|
Re: chi^2 testing of Zipf generator questions
Linas Vepstas
(08 Feb 2024 00:22 UTC)
|
I've been trying to use a chi^2 test for Zipf generators for the parameter sets of tests in zipf-test.scm. The limitation is that I'm running tests only for NVOCAB, the number of bins, > 90 so I can use the normal approximation for the chi^2 probability distribution. Of those tests with NVOCAB > 90 only the following tests fail: (#f 130 4.5 0 30000 106.2896889047007 48.18713521262703) (#f 130 6.66 0 30000 126.1122250147938 48.18713521262703) (#f 131 16.1 41.483 10000 92.25243717046817 48.373546489791295) (#f 131 46.1 41.483 10000 108.9481906236824 48.373546489791295) (#f 131 96.1 41.483 10000 129.19043641049112 48.373546489791295) The first entry of each list is whether the test passes or fails (so these are all failures), then number of bins, s, q, number of samples (abs (- chi^2 mean)) (* 3 (sqrt variance)) chi^2 is the computed chi^2 statistic, and mean and variance are the mean and variance of the chi^2 distribution. The tests pass when the second-last number is less than the last number. So those tests that fail above are not even close. Some other tests were close to failure: (#t 43701 1.000000001 0 60000 876.0176218421475 886.904729945669) (#t 43701 1.000000000001 0 60000 864.1726089265794 886.904729945669) I guess these have barely more samples than bins. My question: Is there something about these parameters that are likely to lead to such failures? Brad