# p. 1

generalization of the principal components analysis to histogram data o rodr´guez1 e d iday2 i 1 and s w insberg1 2 university paris 9 dauphine ceremade pl du ml de l de tassigny 75016 paris france orodrigu@ceremade.dauphine.fr university paris 9 dauphine ceremade pl du ml de l de tassigny 75016 paris france diday@ceremade.dauphine.fr 3 ircam 1 place igor stravinsky f75004 paris france suzanne.winsberg@ircam.fr abstract in this article we propose an algorithm for principal components analysis when the variables are histogram type this algorithm also works if the data table has variables of interval type and histogram type mixed if all the variables are interval type it produces the same output as the one produced by the algorithm of the centers method propose in [2 cazes chouakria diday and schektman 1997 1 the algorithm in this algorithm we use the idea proposed in [5 diday 1998 we represent each histogramindividual by a succession of k intervalindividuals the first one included in the second one the second one included in the third one and so on where k is the maximum number of modalities taken by some variable in the input symbolic data table instead of representing the histograms in the factorial plane we are going to represent the empirical distribution function fy defined in [1 bock and diday 2000 associated with each histogram in other words if we have a histogram variable y on a set e {a1 a2 of objects with domain y represented by the mapping y a u a a for a e where a is frequency distribution then in the algorithm we will use the function f x i i x i instead of the histogram definition 1 let x xij i=1,2 m be a symbolic data table with variables type j=1,2 n continuous interval and histogram and let be k max{s where s is the number of modalities of y j j 1 2 n where y j is of histogram type4 we define the vectorsuccession of intervals associated with each cell of x as 4 if all the variables are interval type then k 1.

[close]

# p. 2

1 if xij [a b then the vectorsuccession of intervals associated is [a b [a b x ij [a b k×1 2 if xij 1p1 2p2 sps with s k histogram then the vectorsuccession of intervals associated is [0 p1 [0 p1 p2 x ij s 0 pw w=1 k×1 3 if xij a then the vectorsuccession of intervals associated is [a a [a a x ij [a a j=1,2 n k×1 definition 2 let x xij i=1,2 m be a symbolic data table with variables type continuous interval and histogram we define the matrix x x for i 1 2 m ij and j 1 2 n it is important to note that x has m · k rows5 and n columns example 1 if x 10.1 20.4 30.5 10.2 20.3 30.5 then 10.7 20.2 30.1 10.8 20.1 30.1 [0.0000 0.1000 [0.0000 0.2000 [0.0000 0.5000 [0.0000 0.5000 [0.0000 1.0000 [0.0000 1.0000 x [0.0000 0.7000 [0.0000 0.8000 [0.0000 0.9000 [0.0000 0.9000 [0.0000 1.0000 [0.0000 1.0000 the idea is to apply the algorithm 3 proposed in [6 rodr´guez 2000 to the matrix i x with this principal components analysis we can find the shape of the individual histogram in the principal plane however because all the individualhistogram will be projected almost in the same position around the origin so we have to apply another principal component analysis in order to find a good cluster structure to the individual histogram therefore we will apply a classical principal component analysis to the matrix presented in the followings definitions 5 k like in the previous definition.

[close]

# p. 3

definition 3 let x xij i=1,2 m be a symbolic data table with variables type j=1,2 n continuous interval and histogram we define the rowvector associated with each cell of x as 1 if xij [a b then the rowvector associated is x ij a+b 2 1×1 2 if xij 1p1 2p2 sps where s is number of modalities of the jth variable then the rowvector associated is x [p1 p2 ps ]1×s ij 3 if xij a then the rowvector associated is x [a]1×1 ij definition 4 let x xij i=1,2 m be a symbolic data table with variables type continuos interval and histogram we define the matrix x x of m rows and ij n j=1,2 n p j=1 sj columns where number of modalities of the variable if the variable j is histogram type if the variable j is interval type sj 1 1 if the variable j is continue type example 2 if x 10.1 20.4 30.5 10.2 20.3 30.5 then 10.7 20.2 30.1 10.8 20.1 30.1 x 0.1 0.4 0.5 0.2 0.3 0.5 0.7 0.2 0.1 0.8 0.1 0.1 the idea of the algorithm is to apply a principal components analysis to the matrix x to find the shape of the individualhistogram and then to apply another principal component analysis to the matrix x using this last principal components we will translate the individualhistogram to find the cluster structure of individualhistogram in the principal plane a lgorithm 1 h istogram p rincipal c omponents a nalysis input m =number of symbolic objects n =number of symbolic variables x11 x12 x21 x22 the symbolic data table x xm1 xm2 · · · x1n · · · x2n · · · xmn

[close]

# p. 4

output the symbolic matrix with the first q principal components y11 y12 · · · y1q y21 y22 · · · y2q y ym1 ym2 · · · ymq where k like in definition 1 yij 1 1 yij yij 2 2 yij yij k k yij yij step 1 compute the matrix x of the definition 2 step 2 apply the algorithm 3 proposed in [6 rodr´guez 2000 taking as input x i it will produce the matrix y11 y12 · · · y1q1 y21 y22 · · · y2q1 y ym1 ym2 · · · ymq1 where k like in definition 1 1 1 yij yij 2 2 yij yij yij k k yij yij for i 1 2 n and j 1 2 q1 with q1 n step 3 compute the matrix x of the definition 4 step 4 apply a classical principal component analysis to the matrix x it will produce the matrix y11 y12 · · · y1q2 y21 y22 · · · y2q 2 y ym1 ym2 · · · ymq2 n where q2 p j=1 sj sj like in definition 4

[close]

# p. 5

step 5 q minq1 q2 step 6 compute the first q principal components y11 y12 · · · y1q y21 y22 · · · y2q y ym1 ym2 · · · ymq using the translation 1 1 yij yij 1 1 yij yij yij yij 2 2 yij yij yij k k yij yij step 7 end of the algorithm 2 2 yij yij yij yij k k yij yij yij yij 2 examples to illustrate how the algorithm works in this section we present two examples example 3 in this example we present the execution of the algorithm 4.7 with the symbolic data table presented in 1 this matrix has five variables the first one is interval type the second one is a variable quantitative discrete and the last three variables are histogram type the values are truncated [1 4 [1 4 [1 5 x [1 4 [1 4 [1 6 2 10.4 20.1 30.2 40.07 50.02 3 10.6 20.1 30.1 50.0 2 10.7 20.2 1 10.7 20.0 30.1 40.0 50.0 60.0 70.0 1 10.4 30.4 40.0 50.0 2 20.4 30.1 40.3 50.0 60.0 70.0 10.1 20.9 10.1 20.9 10.0 20.9 10.0 20.9 10.0 20.9 10.0 20.9 10.7 20.2 10.7 20.2 10.7 20.2 10.7 20.2 10.8 20.1 10.7 20.2 1 applying the algorithm 1 proposed before we get the principal plane of figure 1 if we plot the pyramid see figure 2 associated with the matrix 1 we get the same cluster structure as the one obteined it in the principal plane of figure 1 the individuals east midlands non-metropolitan and northern ireland are isolated and the individuals north non-metropolitan yorks and humberside metropoli yorks and humberside non-metro and east midlands non-metropolitan are grouped 3 the interpretation to explain how to interpret the histogram principal components analysis we will use one small example the interpretation of the position of the histogramindividual in the

[close]

# p. 6

fig 1 principal plane with data of continuous interval and histogram type fig 2 pyramid with data of continue interval and histogram type principal plane is the same as in the classical principal component analysis situation we shall explain the interpretation of the succession of rectangles that represents each individual example 4 let be var-1 var-2 x ind-1 10.1 20.4 30.5 10.2 20.3 30.5 ind-2 10.7 20.2 30.1 10.8 20.1 30.1 this matrix can be also represent like we show in the figure 3 if we apply the histogram principal components analysis to the previous data table we get the principal plane that we show in the figure 4 the smallest rectangle of the projection of the individual1ind1 represents the probability that individual1 takes the modality 1 for the variable 1 or the modality 1 for the

[close]

# p. 7

bibliography 7 fig 3 data table with two individus and two histogram variables variable 2 the size of the rectangle agrees with the representation of the individual1 in the figure 3 because the value of the modality 1 for the variable 1 is 0.1 and the value of the modality 1 for the variable 2 is 0.2 i.e the mean for the modality 1 is 0.15 the second rectangle of the projection of the individual1 represents the probability that individual1 takes the modality 1 or the modality 2 for the variable 1 or the probability that individual1 takes the modality 1 or modality 2 for the variable 2 the size of the second rectangle also agrees with the representation of individual1 in the figure 3 because the value of the empirical distribution function for the modality 2 of the variable 1 is 0.5 and the value of the empirical distribution function for the modality 2 of the variable 2 is also 0.5 the third rectangle of individual1 represents the probability 1 that is the probability that individual 1 takes any of the modalities the smallest rectangle of the projection of individual2 ind2 is bigger than the smallest rectangle of the projection of the individual1 see figure 4 it is consistent with the interpretation because the probability for individual2 to take the modality 1 for the variable 1 is 0.7 and the probability for individual2 to take the modality 1 for the variable 2 is 0.8 i.e the mean of taken the modality 1 is 0.75 this value is bigger than the same value for individual1 that is 1.5 that s why the smallest rectangle of the projection of ind1 is smaller than the smallest rectangle of the projection of ind2 for the same reasons the second rectangle of the projection of ind1 is smaller than the second rectangle of the projection of ind2 references 1 bock h-h and diday e eds analysis of symbolic data exploratory methods for extracting statistical information from complex data springer verlag heidelberg 425 pages isbn 3-540-66619-2 2000 2 cazes p chouakria a diday e et schektman y extension de l analyse en composantes principales a des donn´ es de type intervalle rev statistique appliqu´ e vol xlv num 3 e e pag 5-24 francia 1997.

[close]

# p. 8

8 fig 4 histogram principal component plane 3 chouakria a extension des m´ thodes d analysis factorialle a des donn´ es de type intere e valle th se de doctorat universit´ paris ix dauphine 1998 e e 4 diday e introduction l approche symbolique en analyse des donn´ es premi re journ´ es e e e symbolique-num´ rique universit´ paris ix dauphine d´ cembre 1987 e e e 5 diday e l analyse des donn´ es symboliques un cadre th´ orique et des outils cahiers du e e ceremade 1998 6 rodr´guez r classification et mod les lin´ aires en analyse des donn´ es symboliques ieee th se de doctorat universit´ paris ix dauphine 2000 e e

[close]