Rev. Acad. Canar. Cienc., XII (Núms. 1-2), 147-159 (2000)
ROC CURVES DETERMINATION BY NONPARAMETRIC METHODS
Saavedra, P
Departamento de Matemáticas, Universidad de Las Palmas de Gran Canaria,
35017 Las Palmas de Gran Canaria,
email: saavedra@dma.ulpgc.es
Keywords and Phrases: Nonparametric estimation , optimal bandwidth, diabetes
diagnosis.
ABSTRACT
A ROC curves estimation method is proposed, based on the nonparametric
estimation of the distribution function. An optimal bandwidth expression based on
the mean integrated squared error is estimated by means of a crossvalidation
function. A simulation study is carried out and the methodology is applied to a set
of patients data with diabetic diseases.
l. INTRODUCTION
Diseases, as a general rule, alter the standard values of severa} numeric
variables. Thus, a CD4 lymphocytes depletion account may indicate a VIH
infection and high basal glucose levels may suggest a diabetic illness. When a
determined pathology diagnosis entails sorne risks to the patient or can be
economically very costly, the variables presumably affected can be the basis of an
alternative diagnosis. Thus, when certain pathology causes a decrease of the usual
levels of certain variable, it can be possible to work on an alternative diagnosis
tria!, which consists on a patient classification as sick or healthy according to
147
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
whether the variable measurement is lower or not that certain cut-off value C. The
medica! practitioner will then choose that value of C whose sensibility and
specificity is considered more acceptable. If the basic aim of a diagnosis trial
consists on rejecting or not the disease in the patient, it would be essentially
interesting that the test would have a high sensitivity even if that implies a high
false positive coefficient too. The ROC curve gives the trial sensitivity as a
function of the false positive coefficient. Each point of the curve will be
associated to a cut-off value and therefore, it is enough to choose a point to fix the
cut-off value, the false positive coefficient and the diagnosis trial sensibility.
The ROC curves estimation basically depends on the estimation of the
probability distributions considered in the cases and controls populations. These
distributions are frequently estimated supposing the data are normally distributed
or, since the data present long queues to the right, considering that the logtransformations
of the data follow normal distributions. However, it is very
unusual to make this type of transformations to reach normality in the usual
practice. More general transformations as those of Box-Cox can be used but
generally it is very difficult that the same transformation will lead to normality for
both populations. On the other hand, when the number of data is scarce, the test to
determínate the goodness of the fit does not really clarify if the data are normally
distributed.
An alternative methodology to estimate the ROC curve is based on the
estimation of the probability distribution functions for the marker considered in
the disease cases and controls groups by means of nonparametric methods. The
density function estimation methodology for kernel estimates introduced by
Rosenblatt (1956) is widely developed (Hlirdle, 1991; Cao et al., 1994). Azzalini
(1981) considered the kernel estimation of the distribution function integrating a
kernel estimate of the function density. Recently, Bowman et al. (1998) discussed
a procedure to estimate the smoothing parameter or bandwidth. In this paper we
will consider kernel estimates of the distribution function and to obtain therefore
148
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
an estimate of the ROC curves. Two log-normal distributions with specific
parameters that correspond to the characteristic considered for both groups will be
simulated. Graphically we will compare the theoretical ROC curve with the
obtained by means of the proposed methodology and with the curve obtained
under data normality hypothesis.
2. ROC CURVES
Let's considera population whose individuals can or not have certain illness
for whose diagnosis we dispose of a numeric marker X. Let Fj(x) and Fz(x) be the
probability distribution functions of that characteristic over the sick and healthy's
populations respectively. Let's suppose that the disease produces a diminishing of
the normal X values. The diagnosis criteria based on X will therefore consists on
determining a cut-off value C such that a subject is diagnosed as sick when X~ C
and as healthy otherwise. Then, the sensitivity and the positive false coefficients
of the diagnosis trial are defined as F¡ ( C) and Fz ( C) respectively. Therefore, the
established ROC curve is then defined as the graph that results from plotting the
sensibility versus the false positive coefficient. In those cases where the disease
produces a rise of the characteristic considered, the individual is diagnosed as
being ill when the corresponding X value will be over the cut-off value C. In this
case, the sensibility of the diagnosis trial is given by 1- F; ( C) and the positive
false coefficient by 1 - F; ( C). The diagnostic power of the marker X can be
measured as the area under the corresponding ROC curve. Obviously, areas close
to one indicate a high diagnostic power, while values close to 0.5 or less show a
poor diagnostic power.
In what follows, it will be considered that the distributions F; and F; are
absolutely continuous and therefore, have a density function that it will be
represented by fi(x ) and fi(x) respectively. This supposes that the functions F;
and F; are continuous and strictly increasing in their density functions supports.
149
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Thus, the inverse function F'-z-1 gets defined over ]0,1[. Let's also define
F'-z-1(0) = sup{x; F;(x) =O} and F'-z-1(1) = inf{x;F'-z(x) = l}.
If </J is the positive false coefficient and S the sensibility, it is easy to prove
that the ROC curve corresponds to the graph of the function S(</J) = F¡ o F'-z-1(</J),
</J E ]0,1[, which is obviously a continuous function.
3. KERNEL ESTIMA TES OF THE DISTRIBUTION FUNCTION
Let's suppose that F(x) is the probability distribution function of a random
variable X and let X1, ••• ,Xn be a random sample of F(x). The kernel estímate of
the distribution function F( x) is defined as:
(1)
where W is a distribution function and h the smoothing parameter or bandwidth.
We call the function Wthe integrated kernel since its derivative, if there exists, is
a kernel function in the ordinary sense. In this paper we consider integrated
kernels W, such that the ordinary kernel K(x) = W'(x) is lipschitzian, of compact
support, continuous and with a finite second order moment.
In order to be able to define an optimal bandwidth for the distribution
function of the kernel estimate (1), it should be first calculated the expressions for
the bias and the variance of that estímate.
Theorem l. Let's suppose that W(x) is derivable with W'(x) = K(x), K(x)
verifying the conditions aforementioned and FE C2 • Then:
150
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
ii) var(fr,,(x;h))= F{x){l-F{x)) F'(x)μ,((wz).) h+o(h), h~O (2)
n n
The proof is deferred to appendix.
If the function properties of the empirical distribution are compared with
those of the estimate F,, ( x; h), we may observe that the later implies a bias of order
h2 , though its variance is lower.
We will then consideras optima! bandwidth ~. that which minimizes the
mean integrated squared error, given by:
MISE{h) = E{f {fr,,(x,h)-F(x)r dx J (3)
which is equal to
MISE{h) = f var(fr,,(x;h))dx + j {E[fr,,(x;h)]-F(x)r dx
According to theorem 1, the asymptotically optima! bandwidth is given by:
{ μ, (w . 2) }1/3
(4)
where llF"ll~ = J F"{x)2 dx.
Since ~ is unknown, we use a crossvalidation method due to Bowman et
al for their estimation. Let the crossvalidation function be given by:
151
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
1 n { A }2 cv{h)=;~J lco.->(x-X;)-F:-;(x,h) dx (5)
where F_-;(x,h) denotes the kernel estimate evaluated at observation x, but
constructed from the data with observation X; omitted. The optima! smoothing
parameter ho is then estimated by hn, such that cv( hn} = rnjn cv( h) .
A property of this approach follows by considering:
1 n { }2 H{h)=cv(h)-;~ 110 ... >(x-X;)-F(x) dx (6)
The new term does not involve h and so the crossvalidatory procedure is
unaffected. lt is straightf orward to prove that:
E[ H(h)] = e[J {fr ,,_1(x,h)-F{x)}2
] (7)
This equation suggests that H(h) might be a good approximation to MISE{h).
Bowman shows that under certain conditions, h)ho ~ 1 with probability 1 as
n ~ oo, being ho the optima! bandwidth and hn the one that minimizes cv( h) .
4. ROC CURVE ESTIMA TION
Let X1,1, X1,2 , ••• , X1,n1 and X2,1, X2,2 , ••• , X2.ni be simple random samples of the
.F¡{x) and f2{x) distributions respectively introduced in section 2. We consider for
each distribution the estimate given in (1)
A ( ) F;,nl x ; h¡ =- 1 ~,¿,,¡ w( -x--x' 1.. ) , i = 1,2
n¡ J=I h¡
(8)
152
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
X
where W(x) = J K(t)dt, and K(t) is a kernel function that verifies the properties
given in section 3. Taking into account the properties of the integrated kernel
W(x) , it is obvious that the function Íi'.i.,., (x;h.z) is strictly increasing in
{ x; O < Íi'.i.,., ( x; 11.z) < 1} . We also define ¡;;~~ (O; 11.z) = sup{ x; Fz.,., ( x; 11.z) = O} and
¡;;~;2 (1) = inf{x; Fz,n2 (x;h.z)=1} as in section 2. In this way, we estimate the ROC
curves as:
S(<P;h¡,hz) = Íi'i,n, 0 Pi~;2 (<P); tP E [0,1] (9)
For each considered cut-off value C, the estimated sensibility of the
diagnosis trial is Íi'i.n, ( C;h.z) and the estimated false positive coefficient is given by
Íi'.i.,., ( C; 11.z) .
The consistency of the kernel estimates of the distribution functions implies
the consistency of the ROC curves estimation given by (9), in the sense of that, for
each cutt-off C, Íi'i,n, ( C, h )-7 F¡ ( C) and Íi'.i,n2 { C, h) -7 F; { C) , when
min{ni ,n2}-7 oo .
lt is well known that estimates as (1) produce bias, whose asymptotic
expressions are given by (2). Then, the estimate ROC curve can accumulate the
biases corresponding to both of the estimated distribution functions. Such biases
can be approximate estimating f'(x) from the data as:
]'(x)=-4 Í,K'(x-X;)
n[ i=I [
(10)
where K(x) is a function kernel and 1 the corresponding bandwidth. According to
(2), the expression for the estimated bias is ]'( x) · μ2 ( K) · h2 /2. Thus, the
estimation of the distribution function corrected by bias is:
153
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
A 2
- A f'(x) · μ (K) · h
F(x;h) = F(x;h)- ; (11)
5. APPLICATIONS
A simulation study is carried out in this section; the theoretical ROC curve is
compared with the one obtained by means of the methodology based on the
nonparametric estimation of the distribution functions; it is also compared with
the one obtained supposing normality of the variables within each group. The
ROC curve calculated by using the nonparametric method proposed for the
diabetes diagnosis from the basal glucose determination is given too. For each
group, we ha ve used the Epannnenikov kernel gi ven by K( t) = 3( 1- t 2 ) • l¡-i.!J ( t) / 4.
5.1. Simulation study. Let's suppose that a marker X follows a probability
distribution log-normal such that log(X) is N(2,l) in the sick group and N(3,1/2)
in the healthy group. A random sample of size 60 has been simulated for each
group. We have estimated the ROC curves by using the proposed method based
on the nonparametric estimation of the distribution function. Figure 1 shows a
simultaneous representation of the theoretical ROC curve, the one obtained by
means of the method based on the nonparameric estimation of the distribution
functions and finally, the one obtained under the normality assumptions. Figure 2
shows the theoretical ROC curve jointly with the nonparametric estimation with
and without correction by bias.
5.2. Diabetes diagnosis. The diabetes diagnosis requires to make severa! trials,
e.g. the preparation of a glucose metabolic curve. However, the determination of
basal glucose can be used as a tria! to discard the disease. Therefore, we have
made a ROC curve (figure 3) based on the measurement of the basal glucose in 67
patients having a confirmed diagnosis of diabetes type 2 with 73 controls carried
out at the Hospital Insular of Gran Canaria. The area under ROC curve is 0.7636.
154
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
Since the main purpose of the diagnosis tria! is to discard the disease, the tria!
must have a high sensibility, in spite of the fact that in this case the false positive
coefficient has to be high too. In order to obtain a 80% sensibility a very high false
positive coefficient is required ( 41 % ). That produces a cut-off value C=97 ,64
gram/dl and supposes that for a basal glucose lower than this value, the illness can
be reasonably discarded. However, higher values would require to make
complementary trials. Figure 3 provides the different sensibilities as a function of
the chosen cut-off value C. Figure 4 shows the ROC curve corresponding to
Glucose Tolerance Test (GTT). Fifty-one diabetics patients and the same amount
of controls received the load of glucose, and then, its concentration in blood was
measured. Afterwards, the area under the ROC curve is 0.9549, which it implies a
higher diagnostic capacity. Both ROC curves have been estimated by using the
bias correction given by F(x;h).
155
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
0,0 ,2 ,4 ,6
False positive
,8 1,0
· NORMAL
· Nonparametric -· True
Figure 1. Simulation study. Estimated roe curve by the nonparametric method and
under the hypothesis ofnormality.
~
:o ·¡¡;
e:
Ql en
without correction
Correction by bias
True
o.o ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 1,0
False positive
Figure 2. Simulation study. Theoretical roe curve jointly with the nonparametric
estimation with and without correction by bias.
156
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
1,0
,9
,8
,7
,6
,5
,4
~
,3
:·¡e¡; ,2
e
(])
en ,1
o.o ,2 ,4 ,6 ,8 1,0
False-Positive
Figure 3. Roe curve of basal glucose
1, 1
1,0
,9
,8
,7
,6
,5
-~
:e
º¡¡j
,4
e
(]) en ,3
o.o ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 1,0
False-Positive
Figure 4. Roe curve of GIT
157
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
BIBLIOGRAPHY
Azzalini, A. (1981), "A note on the estimation of a distribution function and
quantiles by a kernel method". Biometrika, 68, 326-328.
Bowman, A., Hall, P. and Prvan, T. (1998), "Bandwidth selection for the
smoothing of distribution functions". Biometrika, 85, 799-808.
Cao, R., Cuevas, A. and González Manteiga, W. (1994), "A comparative
study of several smoothing methods in density estimation". Computational.
Statistics. and Data Analysis., 1, 153-176.
Hardle, W. (1991), Smoothing Techniques. Springer-Verlag.
Rosenblatt, M. (1956), "Remarks on sorne nonparametric estimates of a
density function". Annals of Mathematical Statistics, 21, 832-837.
APPENDIX
Proof of theorem 1.
Let FA ,,(x,h) =-1} n: w(-x--X'.) . Then,
ni=I h
i)
i;f A ] ~LF,,(x;h) = ~JL W (-xh-X-)] = l- w(-xh-u-) f(u)du = hl-W(y)f(x-hy)dy =
ooy oo oo oo
h J J K(z)f(x-hy)dzdy = h J K(z)f f(x-hy)dydx = J K(z)F(x-hz)dz =
F(x)+ F"(x)μ 2(K) h2 +o(h2 ), for h~ O.
2
ii)
158
© Del documento, de los autores. Digitalización realizada por ULPGC. Biblioteca Universitaria, 2017
~ a ~ a
LF(x-hy) ay (w2 )(y)dy = F(x)-h· f(x) Ly· dy (w2 )(y)dy+o(h), for
In this way,
~ a
F(x)-F2(x)-h · f(x) J y·-(W2 )(y)dy+o(h)
- dy
Finally, having in mind that var(F;,(x;h)) = ~var( w( x ~X) J the resultfollows.
159