Rev .Acad.Canar.Cienc. , IX (Núm. 1), 97 -1 24 (1997)
AN ESTIMATOR FOR THE NUMBER OF CLUSTERS IN A POPULATION
JUAN JOSÉ PRIETO MARTÍNEZ
University Carlos III ofMadrid. Department ofStatistics and Econometrics. Calle Madrid. 126
28093 - Getafe - MADRID. SPAIN
Abstract:
Assume that a random sample is drawn from a population with an
unknown number K of clusters. This work propases a nonparametric
method to estimate the number of clusters when most of the
information is concentrated on the low arder occupancy numbers.
This paper derives an estimator to K and proves the asymptotic
distribution using a method of Holst (1979). The performance of
the estimator is investigated by means of Monte Carla experiments
and it is applied to one real data example.
Key words: number of clusters,
method of Holst,
asymptotic normality.
97
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
1 . / Introduction:
Assume that there is an unknown number K of different clusters in
a population. We search this population by selecting one member at
a time, noting its class identity and returning it to the
population. Suppose n selections have been made and p denotes the
J
probability that a ramdomly selected member belongs to the jth
cluster, j=l, . .. ,K, I pJ=l. If pi=l/K, V j=l, ... ,K (the equally
likely or equiprobable assumption) , the problem reduces to an
inference problem involving only one parameter. See, for example
Lewontin & Prout (1956), Darroch (1958) , McNeill (1973), Johnson &
Kot2: (1977), Harris (1968), Host (1981) and Marchand & Schroeck
(1982) .
Probably, in most practical applications, the equally likely
asumption is not val id. For instance , the insects in a forest
classified by species , the words in a computer file classified by
precise letter sequence, ora archeological artifacts classified
by type. Most authors adopted a parametric approach to handle
heterogeneous populations (i.e. unequal clusters probabilities).
For example, Fisher, Corbet and William (1943), assumed that for
each species the number of elements observed in the sample follows
a Poisson distribution and the Poisson parameter is assumed to
have a gamma-type distribution. Many others papers on stochastic
abundance models also make parametric assumption; see, for example
Engen (1978) for a review.
The sample coverage of a random sample from a multinomial
population is defined to be the sum of the probabilities of the
observed clusters. For an equiprobable population, the estimator
proposed by Darroch and Ratcliff (1980) ·exactly used the idea of
sample coverage. For heterogeneous populations, Esty (1985) was
the first to apply the concept of sample coverage to estimate the
number of clusters in a parametric setup. The clusters discussed
by Esty are the different dies in minting. The assumed that the
number of coins that each die produced follows a negative binomial
distribution and obta~ned an estimator of the number of dies in
9X
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
terms of the sample coverage and the parameter of the negative
binomial distribution. A nonparametric estimation technique is
proposed by Chao (1992) to estimate the number of clusters using
the idea of sample coverage. She generalizes the result of Esty
(1985) to a nonparametric approach and extends Darroch and
Ratcliff (1980) to incorporate the heterogeneity of the clusters
probabilities.
The previous authors do not proves the asymptotic distribution of
the estimator proposed.
Thi s work just porposes a nonparametric method to estimate the
number o f clusters when most of the information is concentrated on
(D , N1 , N2 ), where D is the total number of clusters observed in
the sample, and N1 is the number of clusters observed exactly i
times in the s amp le . See section 2.
In the section 3, the asymptotic normality of the estimator is
proved by applying a result of Holst (1979). In section 4, the
results of a simulation study to investigate the performance of
the estimator is showed and also ~he bound is applied to one real
data example.
<)<)
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
2./ An estimator for K.
Let C0 , C1 , C2 , ... , CK be the clusters in a population. Suppose n
selections have been made and p denotes the probability that a
J
ramdomly selected member belongs to the jth cluster, j=0,1, ... ,K;
K
[ P =l.
J= o J
Let e =I (the ith observation belongs to the jth cluster) where
ij
I (A) is the indicator function, i = 1 , . . . , n ; j = O , 1 , . . . , K . Le t
X= [ c. be the number of observations belonging to the jth
J 1 =! 1 J
cluster, then (X0 , X 1 , ... , XK) is distributed as a multinomial
distribution with parameter (n; p 0 , p 1 , .•. , pK) .
i=0,1, ... ,n, be the number of clusters with i
K
K
Le t N = [ I ( X = i ) ,
1 J =! J
representatives in
the sample
clusters.
Define
and D= [ I(X >0) be the number of observed distinct
{
1
z =
j, i o
Hence,
J = 1 J
if the cluster j is observed i times in the sample,
otherwise.
K K K
E(N1)=E(¿ Z. 1 )=¿ Prob(Z 1=1)=¿ ( n )p: (1-pJ)n-l,
J = 1 J' J = 1 J, J = 1
Vi=l,2, ... ,n .
In particular,
K
E(N )=¿ (1-p )°,
o J = 1 J
(2 .1)
K
~ n-1 E(N)=¿np{l-p)
1 J = 1 J J
(2. 2)
100
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
and
K
E ( N2) = \ ( n ) 2 ( 1- ) n-2. L z PJ PJ
J = 1
(2. 3)
It follows from the Cauchy-Schartz inequality that:
( f p (l-p )n-1)2=[ f [p (l-p )n/2-l(l-p )n/2]):
J = 1 J J J = 1 J J J
K K K K
\ ( (l- )n/2-1)2 \ ( (l- )n/2)2= \ 2(l- )n-2 \ (l-p )n.
:s L pi P. L PJ L PJ P. L J
j = 1 J j =1 j =1 J = 1
This is,
K K K
\ n- 1 2 \ 2 n-2\ n
(¿ p (1-p_) ) :s L p (1-p_) L (1-p) ,
j=lJ J j=lj J j=l J
that is equivalent to:
K K K
(\¿ n-1 2 \ 2 n-2 -1 \ n p (1-p ) ) (¿ p (1-p ) ) :S¿ (1-p ) ,
J = 1 J J J = 1 J J J = 1 J
that is equal to:
K ¿ np_(l-p )n-1)2
K J = 1 J J ¿ (1-p )n2:---------- -----------
j=l J n2 Ín(n-l)p2(l-p) n-2
n(n-1)
J = 1 J
Combining ( 2 . 1) , ( 2 . 2) and ( 2 . 3)
(E(N ) ) 2 (n-1)
1
101
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Note that K=N0 +D (the number of clusters in a population is equal
the :number of clusters with O representatives in the sample and
the number of clusters observated) . Thus a lower bound of K is:
K;,;E (D) +
n-1
n
E(N ) 2
1
Replacing E(D), E(N1 ) y E(N2 ) by the observed value, an estimator
if n >0 is:
2
n-1
K=d+ n
Note that if n ~"',
3./ A normal limit law.
2 n
1
n
2
n
n
2
1
2
A limit distribution is rigorously proved for K using a method of
Holst (1979). It is shown the characteristic function of
converges in distribution to a standa~d normal, where ~2 is
specified in the proof.
Let the hipótesis :
K
1. I K- 1n1 / 2P. ---tO.
J = 1 J
102
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
K K
2 . [ K-1 / 2n p2=n-1K-1/2 [ (np ) 2 -o.
j = 1 J j = 1 J
K
4. ¿ O(K-3 / 2 ) -o.
j =1
Let :
A K
K-K=(\ - (I(X.=0) +(1 -p ) n))+ - - .L J J
J = O E (N )
n-1 1 K
(3 .1)
--- (\ (I (X =1) -np (1-p) n-l)) +
E (N ) . L J + n
2 J =O
E 2 (N ) - (n-1) 1 K
---- ----(\ (I(XJ=2) - ( n2 )p2J(l-pj ) n-2)),
n E 2 (N ) Jf.o
+
2
where I (A) is t he usual indicator function, using Taylor series
expansion. Taylor series:
f (x 'X ' ... 'X ) =f (x 'X ' ... 'X ) + 1 2 n 1 2 n o o o ~1 ÓX
t (x1 , x 2 , • .. , X n
+R.
)
o o o
Now regard K as a function of
j =l , 2 , . . . ,K . Expand ~ in the point:
I (X =0),
J
I (X =1)
J
[ ( 1-pJ ) n , np ( 1-p ) n-1 , ( n ) p 2 ( l-p ) n-2]
J J 2 J J
and I (X =2),
J
K
Now the problem is to find the asymptotic distribution o f ¿ f (X )
J = 1 J
103
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
K
and show that the characteristic function ,¡, (s)
converges to that of a normal distribution, where
of K-112 E f (X )
J = 1 J
n-1
f(X )=-(I(X =0)-(1-p )n)+
i i J n
- (n-1)
+ n
First note that:
E(N)
1
----(! (X =1) -np· (1-p )n-l) +
E (N 2 ) J J J
E 2 (N )
1
E 2 (N )
2
K
P (X =X , X =X , ... X =X ) =P (Y =y , Y =y , ... , Y =y / \ Y =n) ,
O O 1 1 K K O O 1 1 K K J f O J
where {YJ} are independent Poisson random variable with mean npJ.
Its proof is:
K
K p ( y 1 =X 1 ' .•. ' y K =XK ' " Y =n)
P (Y =y , Y =y , ... , Y =y / \ Y =n)
0011 KKLJ
L., J J
K
Hence,
K
P(¿ Y =n)
J = 1 J
n!
X ! ....... X
1 n
J =O P(¿ Y =n)
J = 1 J
x 1 -np 1 x · -np
[(npJ) /x 1 !]e ..... [(npJ) KxK!]e K
-Knp
[(Knp )/n!]e J
J
104
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
It then tollows from Lemma 2 of Holst (1979)
"Let (U, V) be a two dimensional random vector with U integer
valued. Then
that:
Since,
1 rr
E (e 1vv /U=n)
2rrP(U=n) I E (eiu(U-n)+ivV) du)
1/J (s)
1
K
2rr(¿ Y =n)
J = o J
-rr
K K
-1/2 ¿ iu¿ (Y -np )+isK f(Y )
Jr r J J J E(eJ=o J=l )du.
-rr
K K K
E(¿Y.) =¿E(Y )=¿np =n and n! =e-nv2rrn nn,
J =O J J =O J J =O J
P (! Y =n) =e-n
J =O J
n n
e-nv2rrn nn
1
Let t =uvn to obtain
K K
,tn-112 ~ (Y -np )+isK-112 ~ f(Y)
1 JTfl /2 L J j L J
1 n E(e J=O J=l )n-l/2dt=
2rr--- -rrn112
v2rrn
10:'i
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
K K
1
-1/2 \
-1/2 ¿ (Y -np )+lsK l. f(Y ) I 7In1/2 iln J=O J J J=l i
E(e )dt.
1/2 -rrn
Define:
1
H (s)=--
n v2rr I ITn 1/2
h (s,t)h (t)dt ,
1 n 2n
1/2 -rrn
where
-1/2
K ltn-1/2(YJ-npj)+lsK f(YJ)
h 1n (s, t) =n E (e
J = 1
and
-1/2
lln (Y -np )
h (t)=E(e o o)
2n
Note that Y0 for given nis a Poisson random variable,
-1/2 -1/2 -1/2
ltn cv0 -np0 ) ltn Y -ltn np
h (t)=E(e )=E(e 0e º)=
2n
e
-1/2 -1/2
-1 tn np oo I tn J -np0
=e O ¿ e e
J = O
- 1 / 2
(np ) J
o
j !
-1/2 ( e I t n np ) J
-i tn np oo o
=e º"--~--- L. j !
J=O
1/2
-ltn p0
10(1
e
-1/2
lln
npo(e -1)
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
1/2 1/2 2 2
-itn p0 np0 ((1+(it/n )-(t /2n)+O(t )]-!)
=e e
1/2 1/2 2 2)
-itn p0 np0 (it/n )-np0 Ct /2n)+O(t
=e e
1/2 1/2 2 2 2 2
-itn p0 +p0 itn -np0 (t /2n)+O(t ) (-t p0/2)+0(t)
=e =e
Considerer the factors of h (s, t):
In
1(
h (s,t)= n g (s,t),
In J = l J
-1/2 -l/2f(Y)
it(Y -np )n +tsK j
where g_ (s, t) =E (e J J
J
Now,
where
-1/2 -112
it(Y -np )n +islC f(Y)
E (e J J J ) =
-112 -112
it(Y -np )n +islC (A !(Y =o) +A !(Y =!)+A !(Y =2))
=E ( e J J t J 2 J 3 J ) x
-1/2 n n-1 ( n
-isK (A (1-p) +A np (1-p) +A
xe 1 J 2 J J 3 2
n-1 E(N 1 )
A1 =-1, A ---- y A=-
2 n E (N 2 ) 3
) 2 n-2
p (1-p ) )
J J
n-1 E 2 (N )
1
n
E 2 (N )
2
Part A involves expectation:
107
(A)
(B)
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
+e
l/2 -1/2 -1/2 -1/2
.-itn p +isK A -np 1t(1-npJ)n +lsK A -np_
e J 1e J+e 2np e 1+
• -1/2 -1/2
-it(2-npj)n +lsA3K
(np.) 2
J
2
J
-1/2 (np ) R
oo I tn (R-np ) J +\e J ____ _
L R!
R =3
-np
e J
Since {Y} are independent Poisson random variable with mean np,
J J
1/2
-np -i ln p
e Je Jx
: -1/2
{ - 1
isK A
x [e -1] +np/
-1/2
itn
[e
-1/2
isK A
2 -1] +
(np )2
J
~1/2
21 tn
-1/2
(C )
+ 2
e
lsK A3 }
[e -1) +
-1/2
+e
1/2 1 tn
-itn P. np (e -1)
Je J (D).
Note that (D) is:
-1/2
1/2 i tn
e
- ltn p np (e -1)
Je J
=e
1/2 -1/2 2 -3/2 3
-itn p np {itn -(l /2n)+O(n t
Je J
2 -1/2 3 2
-p (t /2)+0(n p t ) - p t /2
=e J J =e J (l+O (n-1/ 2 t 3p ) )
J
The expression (C) is:
IOX
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
1/2 -1/2 -1/2 -1/2
-np - i tn p { i sK A i tn i s K A
e Je J (e 1 -1) +npJe (e 2 -1) +
Now, expanding:
1/2 -1/2 - 1/2
- i tn p { i sK A i tn
e J (e 1 -1) +npJe
= (l-itn1 / 2p_+
J 2
+
(np.) 2
J
2
-1/2
i s K A
(e 2 - 1) +
(np.) 2
J
+ 2
e
it
e
-1/2
2 itn
- 1/2
2itn
-1/2
(e 3 -1 ) .
isK A }
- 1/2
(e -1) =
isK A3 }
2K
+ O(K-Y2 ) )+np [l+
j 1 / 2 n
(isK-1/ 2A -
2
s A2
2 2
2K
2ti
s2A2
+
(np2 )
J
2
[l+ ---+ O(n-1)) [i sK-1/ 2A -
1 / 2 3
---3-+ O (K-3/2) }= 2K
= (1-itn1 / 2p -
j 2
n
+ O (n3 / 2p 3 ) { (isK-1 / 2A -
j 1
s 2 A 2 stA
2 2
+ np (isK-1/ 2A -
j 2 2K
+
(np ) 2
j
-----(isK-1/2A -
2 3 2K
109
2K
2stA
3
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
= (l-itn1/ 2p -
j
np .t 2
j
2
+ O (n 3/2p.)) {isK-1/ 2 (A +np A+
J 1 j 2
2
S 2 2
- --(A +np A+
2K 1 J 2
ts
(np )2
(np )2
j
2
A2) -
3
(np A+
j 2
2
s
(np ) 2
j
2
j
=isK-1/ 2 (A +np A + A) - (A2+np A2+
1 j 2 2 2K 1 j 2
2ts
(np ) 2
j
(np A+ A ) +O (K-3/ 2) +
n1/2K1/2 j 2 2 3
ts (np.)
J
(np)
j
+ (A +np A+
n1/2K1 / 2 1 j 2 2
(np ) 2
j
2
2
A3)+
Under assumption (3.1), this expression is equal to:
2 s
isK-1/ 2 (A +np A +
1 j 2
(np ) 2
j
2 2K
st
+ ----(np A +np A (np -1) +
nl/2Kl/2 J 1 J 2 J
Hence:
2
K K -p / / 2{ - 1/2 3 2
n g ( s , t) =n e [ 1 +O ( n t p ) ) [ 1 +O ( p t ) ) x
J=lj j =l j j
110
(np )2
j
2
A2) -
3
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
ts
+ (np A +np A (np - 1 ) +
nl/2Kl/2 J 1 J 2 J
-1/2 n n-1 (
-isK (A (1-p) +A (1-p) np +A
xe 1 J 2 J J 3
K 2 -p l /2{ -np
j -1/?. 3 2 =ne (l+O(n tp)+(l+O(pt)e Jx
J =l J J
x [isK-1 / 2 (A +np A+
1 J 2
(np ) 2
J
2
(np) 2
J
-----c,----A2)+
2 3
+
ts
------(np A +np A (np -1) +
nl /2Kl/2 J 1 J 2 J
(np. ) 2
---J-- (np -2)A )+O(K-1 ))}x
2 J 3
(np ) 2
{
-1/2 - n p . -np J -np J
x 1-isK [A e J+A np e J+A -----e ) -
1 2 J 3 2
2 (np_) 2
S -np -np J
- -2K [A e J+A np e J+A __ 2 __ _
1 2 J 3
2 (np ) 2
=ne J 1-isK [A e J+A np e J+A ---,,----e J) -
K -p l /2{ -1/2 -np -np J -np
J = 1 1 2 J 3 2
2 (np ) 2
S - np -np J -np
- - 2K [A e J+A np e J+A -----e Jl 2+
1 2 J 3 2
(np )2
- np - np J
+isK- 1/ 2 [A e J+A np e J+A -----
1 2 J 3 2
-np
e Jl -
111
2
s
2K
-np J 2 2
e <A +np A+
1 J 2
(np ) 2
J
--,---A2) +
2 3
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
ts -np
(np ) 2
J
e 1 (A np +A np ( np -1 ) +A
1 J 2 J J 3 ;. 2 (np1 -2 ) ) +
2
S - np -np
+ -K- [A e J +A np e J +A
1 2 J 3
2 2 (np.) 2
K -p J t /2 { s
{e-np1 (A +np A2+A2
J
=n e 1 - 2K 2
) -
j = 1
1 J 2 3
(np )2
-np -np J
[A e 1+A np e 1+A e -np JJ 2}
2 +
1 2 J 3
ts
(np ) 2
-np. J
+ e J (A np +A np (np -1) +A (np -2))+0(K-1)=
nl /2 Kl/2 1 J 2 J J 3 2 J
2
1( -p t /2 IC {
=n e J n 1-
j = 1 J = 1
2 s
2K
(np ) 2
{
-np J 2 2 2 J
e (A +A np +A --2---) -
1 2 J 3
-np -np
- [A1e i+A e Jnp +A
2 j 3
ts
(np )2
e-npJ (Anp +Ap (np - l)+A-- -J-- (npJ-2)+0(K-1 )} .
1 J 2 J J 3 2
+
Then,
2
- (1 -p )t /2 {
g J ( s , t) =e O exp -
-np -np
- [A e J+A np e J+A
1 2 J 3
(np )2
ts - np J }
+ e J (A np +A np (np - 1) +A ----(npJ-2)) ,
n1/2K1/2 1 J 2 J J 3 2
and
11 2
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
2 2
K -<1-p0 Jt /2 -p0 t /2 n g (s,t)h (t)=e e X
j = 1
J 2n
(np ) 2
x exp{---t-8
--{ f e-npJ [A np +A np (np -1) +A ---J--(npJ-2)) }}
n 1 / 2 K 1 / 2 Jf 1 1 J 2 J J 3 2
Then,
ts
2 +
(np ) 2
j -np
-----e J_
2
(np ) 2
-2np J }} -e J [A +A np +A --~--J 2 =
1 2 J 3 2
(np ) 2
{ Í e-npJ [A np +A np (1-np) +A ---J--(npJ-2) }}x
J = l 1 J 2 J J 3 2
1
H (s)=-- J e(-t/2 'exp (ts/(nK) 1 / 2 )¿(:1(j) -( s 2 /2K) ¿o:(j) dt= +rrn1
/
2 2 { K K }
n v2rr
-rrn 1/2 J=l J=l
K
rrr//
2
{
I (:1 ( j)
1 J = 1
exp (-1/2) (t- S) 2 }dt X
V2rr 1 / 2 1/2 ( n K) -rrn
11 :l
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
where
(np )2
-np · 2 2 J
ex (j) =e J (A +A np +A --2--
1 2 J 3
-2np [
- e J A +A np +A
1 2 J 3
and
(np )2
-np J
/3 (j) =e J (A np +A np (np -1) +A----
1 J 2 J J 3 2
(np - 2)) .
If n, K-oo, the limit of H (s) is:
n
1(
1 J+rrn
1
/
2 -t 2 { /2 e exp ( -1 / 2) ( t -
1/ 2
-rrn
1: /3 ( j l
J = l ) 2}dt X
(nK) 1 / 2
lim H (s) =lim
n, IC'700 n n, IC'700 v2rr
It follows from the dominated convergence theorems that:
Then,
1 J+rrn1/ 2 -t 2 /2 { e exp (-1/2) {t-
1/2
-rrn
lim
n , IC '700 v2rr
1 2
J lim +oo
n, K'"7co
-t /2 {
e exp ( - 1 / 2 ) ( t -
-00
1 2
-t /2
1(
1: /3 ( j l
J = 1
(nK) 1 / 2
1(
1: /3 ( j)
J = 1
(nK) 1/ 2
e dt=l.
114
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
lim H (s)= lim exp{(s2/2nK)J(1:=1(3(j)) 2-(s2/2K)JÉ=10:(j)}= n, K~oo n n, K'°"co
{
-1 K
=exp (-s2/2) lim [nK(¿/3(j)) 2+
n,K-+oo J = l
1
K 1:o:(j)J}.
J =1
Hence K-112 (;-K) converges to a normal distribution with mean O
and:variance:
0"22= 1· { 1 f (') nlK J('=K1/3(J'))2}· K, ~~(X) I<Jf/ J - f..
and (K-K) converges to a normal distribution normal with mean O
2 A 2 and variance <Y (K)=K<Y.
Now, assume
np1,np2 . .. ,npk
(O,oo), where
Then
1
K
- 2
that
converge
the
to
empirical distribution
a probability distribution
1 K
G (x) =- \ I (np :sx)
n K L J
J =O
lim { Kl f o:(j ) }=
K, n"7oo L
J = 1
-2np [
e J A +A np +A2
1 2 J 3
X -2x X
G (x)
n
G(x)
{
2 2 J -x 2 2 2 2
e (A +A x+A - 2-) - e (A +A x+A - 2- ) dG (x),
1 2 3 1 2 3
115
of
on
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
and
lim
K, n~oo (npj-2)) }}~
2
~{KJ 00 [KxldG(x) } - 1 {KJ 00 [ex(A1x+A2x(x-l)+A3--;-- (x-2))]dG(x) } 2
,
o o
because
n=E(¿X)=E(¿ Y.)=¿ np.=K xdG(x).
K K K Joo
jal J jal J j • l J O
Therefore,
{
X 2 -~ X 2 }
-x 2 2 2 2 2 e (A +A x+A - 2-) -e (A +A x+A - 2-) dG (x) -
l 2 3 1 2 3
2
- {KJ00 [Kx]dG(x) } - 1 {KJ00 [ex(A1x +A2x( x - l)+A3--;-- (x-2) )]dG(x) } 2=
o o
=r o
{
?. 4
X -2x X
-x 2 2 2 ?. 2 4 e (A +A x+A - 2-) - e (A +Ax +A--+
l 2 3 l 2 34
,2A A x+A A2 x 2 +A xA 2 x 2 ) } dG (x) -
1 2 l 3 2 3
2
- {KJ 00 [Kx]dG(x) } - 1 { KJ00 [ex(A1x+A2x(x-l)+A3--;-- (x-2))]dG(x) } 2=
o o
116
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Joo -2x 2 2 2 [xe (2A A +A A x+A Ax ) ] dG (x) -
1 2 1 3 2 3
o
{KJ
00
[Kx]dG(x) } - 1 { KJ
00
[xex(A1 +A2 (x-l)+A3 : (x-2) )]dG(x) } 2
o o
An upper bound of asymptotic variance of K is
A~ Joo 2 -x Joo -x 2 2 2 + - 2- [xe ]G(x) - [xe (2AA+AAx+AAx )]dG(x)-
1 2 1 3 2 3
o o
{Kf
00
[Kx]dG(x) } - 1 {Kf
00
[xex(A1+A2 (x-l)+A3 : (x-2) )]dG(x) } 2
)
o o
Note-that:
=K J :[e-xxi/i!]dG(x),
then i!E(Ni)=K f:[e-xxi]dG(x).
Also
K K K
E(D)=E(I I (X >0) )=I Prob(X >Ü)=I [1-Prob(X.=0)]=
j = 1 J j = 1 J j = 1 J
117
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Using this relations, ~ 2 (i) is equal to:
-2 A 2
cr (K)=E(D)+AE(N)+
- 2 1 2
2E (N ) -2A A E (N ) -A A22E (N) -A A23 !E (N ) -
2 12 1 13 2 23 3
[A E(N )+2A E(N )-A E(N )+(A / 2)3!E(N )-A 2E(N )] 2
1 1 2 2 2 1 3 3 3 2
K E(I y)
J = 1 J
Replacing E(N1 ) (with i=l, 2, 3) and E(D) by the observed values,
~ 2 (;) can be estimated by:
4
(n-1) 2 n
1 n-1
+ + 2---
4
+ (~)2-n1
_
n 2n3
2
[
n - 1
- n1 + n
1
K (I Y_)
J = 1 J
- 6
4n
2 3 n
n
2
5 (--;=-) 3-n1
2n5
2
3
2
n
1 n - 1
2n - --
2 n
1
n 2 n n
2 2
n-1
4n
2 n
1
2 n
2
n-1
n+--
3 n
3 4 2n
2 4
[ n-1 J'{ n,
n n
1 n 1 1 n-1
=d+ --n- ~ + + + --- -
4n
3 n-1 n
2n 3 n
2
2 2
[ - n 1 +
2
n - 1 n - 1
n
1 1
2n - 6
K n 1 4n 2 (I y )
n
2
J = 1 J
118
n
3
2
n
1
n +
2
5 n
3
n3 )-
1
5 2
n
2
nJ 2
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
4./ Numerical examples.
Example l.
A simulation study was carried out to investigate the performance
of the estimator:
2
n-1 n
1
K;d+
2n n
2
The true number of cluster was fixed at 200. Several populations
with observation probability ranging from 0.002 to 0.01. For each
given population the program produced 100 simulation runs with
size sample n;50 and n;lOO. These 100 values were average to give
the results of table l. Note that the values most important in the
sample are N1 , N2 and K.
Also it was calculated:
A A 1 SQ A
Bias (K) ;E [ (K -K)) ;-50 '\' (K -K).
- -J L -J
j ; 1
A A 1 SQ A
ECM (K) ;E [ (K -K) 2 ) ;5-0 l (K -K) 2 .
- - j - j
j ; 1
The simulation results indicate that :
- For population with equal observation probability (case 1), the
estimator work very well.
- For any fixed popuation, the standard error for the estimator
when n;lOO is smaller than that of n;50.
The standard error of estimator increases as the degree of
heterogeneity of the population is increased.
Var(K) decrease when n increase.
119
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Table l.
A A A
Cases n P J K Var (!5') E(K .- K) ECM
- J
l. 50 PJ =0.005 203.32 11. 74 2. 96 2 0 .46
j=l-200
100 201.01 2.98 0.87 3.73
2 50 PJ =0.004 199.21 2 . 30 0.64 2.70
j=l-100
PJ =0.006
j=lOl - 200
100 202.13 6.87 l. 87 10.2
3 50 PJ =0.0035 203.41 11. 36 2 . 14 15.57
j=l - 90
PJ =0.0045
j =91 - 180
PJ =0.014
j=181 - 200
100 201.52 3 . 23 l. 79 6.43
4 50 PJ =0.01
j=l-10
204 . 49 18 . 86 3.96 34.54
PJ =0.004
j =ll-100
PJ =0.003
j=lOl-191
PJ =0.023
j=l91 - 200
100 198.21 6.03 2 . 07 10.28
120
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Continue table l.
A A A
Cases n PJ K Var(!5) E(!5J- K) ECM
5 50 pj =0.0035 195.68 28.97 5.11 56.28
j=l-50
PJ =0.006
j=51-100
P.=0,002
J
j=lOl-125
pj =0.009
j=l26-150
PJ =0.005
j=l51-200
1
100
203.34 11.31 3.14 20.85
6 50 PJ =0.006 212.23 154.85 11. 83 292.24
j=l-25
PJ =0.0025
j=26-50
PJ =0.009
j=51-75
PJ =0.008
j=76-100
PJ =0.001
j=lOl-125
PJ =0.002
j=l26-150
PJ =0.005 190.02 114.48 9.76 208.47
j=l51-175
PJ =0.004
j=l76-200
- Note that ECM- [E(K -K)) is roughly Var(K) .
-J
121
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Eample 2.
This interesting example was describe in Holst (1981)
number of dies produced r coins, r=l, 2, ... , in a
problem is to estimate the number of dies used in
Given the
hoard, the
the minting
process. I first discuss the reverse side: 204 coins were found in
a hoard of ancient coins, 156 appeared once, 19 twice, 2 three
times, and 1 four times, no die appeared more than four times. For
this frecuency sequence, as explained by Holst (1981) , it is
plausible to assume that all the classes are equally likely. He
further obtained an estimate 731 of the number o f clusters. The
estimate proposed in this work is K=818.
Acknowl edgements.
I want to thanks Dr. Anne Chao (Institute of Statistics in
National Tsing Hua University, Hsin-Chu, Taiwan) for their
important contributions to this work~
122
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Bibliography:
Chao, A. (1992). .Estimating the number of classes via sample
coverage. Journal of the American Statistical Association, 87,
417, 210-217.
Darroch, J .N. (1958). The multiple recapture census I: Estimation
of a closed population. Biometrika, 40, 343-359.
Darroch, J.N and Ratclif, D. (1980). A note on capture-recapture
estimation". Biometrika, 45, 343-359.
Engen, S. (1978). Stochastic Abundance Models, London: Chapman and
Hall.
Esty, W. W. (1985). The estimation of the number of classes in a
population and the coverage of a sample. Mathematical Scientist,
10, 41 - 50.
Fisher et all . (1943). The relation between the number of species
and the number of individuals in a random sample of an animal
population . Journal of Animal Ecology, 12, 42-58 .
Harris, B. ( 1968) . Statistical inference in the classical
occupancy problem unbiased estimation of the number of classes.
Jou~nal of the American Statistical Association, 63, 837-847.
Holst, L . (1979). A unified approach to limit theorems for urn
models. Journal applied probability, 16, 1, 154-162.
Holst, L. ( 1981) . Sorne asymptotic result for incomplete
multinomial or poisson samples. Scandinavian Journal of Statistic,
8, 243 - 246.
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Johnson, N.L. and Kotz, S. (1977). Urn models and their
applications: An approach to modern discrete probability theory,
New YorK: John Wiley.
Lewontin R.C. and Prout, T. (1956). Estimation of the number of
classes in a population" Biometrics, 12, 211-223.
Marchand, J.P. and Schroech, P.E. (1982). On estimation of the
number of equally likely classes in a population. Communications
in a Statistics, Part A-Theory and Methods, 11, 1139-1146.
McNeil, D. (1973). Estimating an author's vocabulary. Journal of
the American Statistical Association, 68, 341, 92-97.
12-1
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017