p. 1
an introduction to r notes on r a programming environment for data analysis and graphics version 2.15.1 2012-06-22 w n venables d m smith and the r core team
[close]
p. 2
copyright copyright copyright copyright copyright ccccc 1990 w n venables 1992 w n venables d m smith 1997 r gentleman r ihaka 1997 1998 m maechler 1997 r core team copyright c 19992012 r core team permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one permission is granted to copy and distribute translations of this manual into another language under the above conditions for modified versions except that this permission notice may be stated in a translation approved by the r core team isbn 3-900051-12-7
[close]
p. 3
i table of contents preface 1 1 introduction and preliminaries 2 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 the r environment related software and documentation r and statistics r and the window system using r interactively an introductory session getting help with functions and features r commands case sensitivity etc recall and correction of previous commands executing commands from or diverting output to a file data permanency and removing objects 2 2 2 3 3 4 4 5 5 6 6 2 simple manipulations numbers and vectors 7 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 vectors and assignment 7 vector arithmetic 8 generating regular sequences 8 logical vectors 9 missing values 10 character vectors 10 index vectors selecting and modifying subsets of a data set 11 other types of objects 12 3 objects their modes and attributes 13 3.1 3.2 3.3 3.4 intrinsic attributes mode and length changing the length of an object getting and setting attributes the class of an object 13 14 14 15 4 ordered and unordered factors 16 4.1 4.2 4.3 a specific example 16 the function tapply and ragged arrays 16 ordered factors 17
[close]
p. 4
ii 5 arrays and matrices 19 arrays array indexing subsections of an array index matrices the array function 5.4.1 mixed vector and array arithmetic the recycling rule 5.5 the outer product of two arrays 5.6 generalized transpose of an array 5.7 matrix facilities 5.7.1 matrix multiplication 5.7.2 linear equations and inversion 5.7.3 eigenvalues and eigenvectors 5.7.4 singular value decomposition and determinants 5.7.5 least squares fitting and the qr decomposition 5.8 forming partitioned matrices cbind and rbind 5.9 the concatenation function c with arrays 5.10 frequency tables from factors 5.1 5.2 5.3 5.4 19 19 20 21 21 22 23 23 23 24 24 24 25 25 26 26 6 lists and data frames 28 lists constructing and modifying lists 6.2.1 concatenating lists 6.3 data frames 6.3.1 making data frames 6.3.2 attach and detach 6.3.3 working with data frames 6.3.4 attaching arbitrary lists 6.3.5 managing the search path 6.1 6.2 28 29 29 29 29 30 30 31 31 7 reading data from files 32 the read.table function the scan function accessing builtin datasets 7.3.1 loading data from other r packages 7.4 editing data 7.1 7.2 7.3 32 33 33 34 34 8 probability distributions 35 8.1 8.2 8.3 r as a set of statistical tables 35 examining the distribution of a set of data 36 one and two-sample tests 39 9 grouping loops and conditional execution 42 9.1 9.2 grouped expressions control statements 9.2.1 conditional execution if statements 9.2.2 repetitive execution for loops repeat and while 42 42 42 42
[close]
p. 5
iii 10 writing your own functions 44 44 45 45 46 46 46 46 47 48 48 50 51 10.1 simple examples 10.2 defining new binary operators 10.3 named arguments and defaults 10.4 the argument 10.5 assignments within functions 10.6 more advanced examples 10.6.1 efficiency factors in block designs 10.6.2 dropping all names in a printed array 10.6.3 recursive numerical integration 10.7 scope 10.8 customizing the environment 10.9 classes generic functions and object orientation 11 statistical models in r 54 54 56 57 57 58 59 59 60 60 61 63 64 65 65 11.1 defining statistical models formulae 11.1.1 contrasts 11.2 linear models 11.3 generic functions for extracting model information 11.4 analysis of variance and model comparison 11.4.1 anova tables 11.5 updating fitted models 11.6 generalized linear models 11.6.1 families 11.6.2 the glm function 11.7 nonlinear least squares and maximum likelihood models 11.7.1 least squares 11.7.2 maximum likelihood 11.8 some non-standard models 12 graphical procedures 67 67 67 68 68 69 70 71 72 72 73 73 74 74 75 76 76 12.1 high-level plotting commands 12.1.1 the plot function 12.1.2 displaying multivariate data 12.1.3 display graphics 12.1.4 arguments to high-level plotting functions 12.2 low-level plotting commands 12.2.1 mathematical annotation 12.2.2 hershey vector fonts 12.3 interacting with graphics 12.4 using graphics parameters 12.4.1 permanent changes the par function 12.4.2 temporary changes arguments to graphics functions 12.5 graphics parameters list 12.5.1 graphical elements 12.5.2 axes and tick marks 12.5.3 figure margins .
[close]
p. 6
iv 12.5.4 multiple figure environment 12.6 device drivers 12.6.1 postscript diagrams for typeset documents 12.6.2 multiple graphics devices 12.7 dynamic graphics 78 79 80 80 81 13 packages 82 standard packages 82 contributed packages and cran 82 namespaces 83 13.1 13.2 13.3 appendix a appendix b b.1 b.2 b.3 b.4 a sample session 84 invoking r 88 88 92 93 94 invoking r from the command line invoking r under windows invoking r under mac os x scripting with r appendix c c.1 c.2 c.3 the command-line editor 96 preliminaries 96 editing actions 96 command-line editor summary 96 appendix d appendix e appendix f function and variable index 98 concept index 101 references 103
[close]
p. 7
preface 1 preface this introduction to r is derived from an original set of notes describing the s and splus environments written in 19902 by bill venables and david m smith when at the university of adelaide we have made a number of small changes to reflect differences between the r and s programs and expanded some of the material we would like to extend warm thanks to bill venables and david smith for granting permission to distribute this modified version of the notes in this way and for being a supporter of r from way back comments and corrections are always welcome please address email correspondence to r-core@r-project.org suggestions to the reader most r novices will start with the introductory session in appendix a this should give some familiarity with the style of r sessions and more importantly some instant feedback on what actually happens many users will come to r mainly for its graphical facilities in this case chapter 12 [graphics page 67 on the graphics facilities can be read at almost any time and need not wait until all the preceding sections have been digested.
[close]
p. 8
chapter 1 introduction and preliminaries 2 1 introduction and preliminaries 1.1 the r environment r is an integrated suite of software facilities for data manipulation calculation and graphical display among other things it has · an effective data handling and storage facility · a suite of operators for calculations on arrays in particular matrices · a large coherent integrated collection of intermediate tools for data analysis · graphical facilities for data analysis and display either directly at the computer or on hardcopy and · a well developed simple and effective programming language called `s which includes conditionals loops user defined recursive functions and input and output facilities indeed most of the system supplied functions are themselves written in the s language the term environment is intended to characterize it as a fully planned and coherent system rather than an incremental accretion of very specific and inflexible tools as is frequently the case with other data analysis software r is very much a vehicle for newly developing methods of interactive data analysis it has developed rapidly and has been extended by a large collection of packages however most programs written in r are essentially ephemeral written for a single piece of data analysis 1.2 related software and documentation r can be regarded as an implementation of the s language which was developed at bell laboratories by rick becker john chambers and allan wilks and also forms the basis of the s-plus systems the evolution of the s language is characterized by four books by john chambers and coauthors for r the basic reference is the new s language a programming environment for data analysis and graphics by richard a becker john m chambers and allan r wilks the new features of the 1991 release of s are covered in statistical models in s edited by john m chambers and trevor j hastie the formal methods and classes of the methods package are based on those described in programming with data by john m chambers see appendix f [references page 103 for precise references there are now a number of books which describe how to use r for data analysis and statistics and documentation for s/s-plus can typically be used with r keeping the differences between the s implementations in mind see section what documentation exists for r in the r statistical system faq 1.3 r and statistics our introduction to the r environment did not mention statistics yet many people use r as a statistics system we prefer to think of it of an environment within which many classical and modern statistical techniques have been implemented a few of these are built into the base r environment but many are supplied as packages there are about 25
[close]
p. 9
chapter 1 introduction and preliminaries 3 packages supplied with r called standard and recommended packages and many more are available through the cran family of internet sites via http cran.r-project.org and elsewhere more details on packages are given later see chapter 13 [packages page 82 most classical statistics and much of the latest methodology is available for use with r but users may need to be prepared to do a little work to find it there is an important difference in philosophy between s and hence r and the other main statistical systems in s a statistical analysis is normally done as a series of steps with intermediate results being stored in objects thus whereas sas and spss will give copious output from a regression or discriminant analysis r will give minimal output and store the results in a fit object for subsequent interrogation by further r functions 1.4 r and the window system the most convenient way to use r is at a graphics workstation running a windowing system this guide is aimed at users who have this facility in particular we will occasionally refer to the use of r on an x window system although the vast bulk of what is said applies generally to any implementation of the r environment most users will find it necessary to interact directly with the operating system on their computer from time to time in this guide we mainly discuss interaction with the operating system on unix machines if you are running r under windows or mac os you will need to make some small adjustments setting up a workstation to take full advantage of the customizable features of r is a straightforward if somewhat tedious procedure and will not be considered further here users in difficulty should seek local expert help 1.5 using r interactively when you use the r program it issues a prompt when it expects input commands the default prompt is which on unix might be the same as the shell prompt and so it may appear that nothing is happening however as we shall see it is easy to change to a different r prompt if you wish we will assume that the unix shell prompt is in using r under unix the suggested procedure for the first occasion is as follows 1 create a separate sub-directory say `work to hold data files on which you will use r for this problem this will be the working directory whenever you use r for this particular problem mkdir work cd work 2 start the r program with the command r 3 at this point r commands may be issued see later 4 to quit the r program the command is q at this point you will be asked whether you want to save the data from your r session on some systems this will bring up a dialog box and on others you will receive a text prompt to which you can respond yes no or cancel a single letter abbreviation will
[close]
p. 10
chapter 1 introduction and preliminaries 4 do to save the data before quitting quit without saving or return to the r session data which is saved will be available in future r sessions further r sessions are simple 1 make `work the working directory and start the program as before cd work r 2 use the r program terminating with the q command at the end of the session to use r under windows the procedure to follow is basically the same create a folder as the working directory and set that in the `start in field in your r shortcut then launch r by double clicking on the icon 1.6 an introductory session readers wishing to get a feel for r at a computer before proceeding are strongly advised to work through the introductory session given in appendix a [a sample session page 84 1.7 getting help with functions and features r has an inbuilt help facility similar to the man facility of unix to get more information on any specific named function for example solve the command is helpsolve an alternative is ?solve for a feature specified by special characters the argument must be enclosed in double or single quotes making it a character string this is also necessary for a few words with syntactic meaning including if for and function help either form of quote mark may be used to escape the other as in the string it s important our convention is to use double quote marks for preference on most r installations help is available in html format by running help.start which will launch a web browser that allows the help pages to be browsed with hyperlinks on unix subsequent help requests are sent to the html-based help system the `search engine and keywords link in the page loaded by help.start is particularly useful as it is contains a high-level concept list which searches though available functions it can be a great way to get your bearings quickly and to understand the breadth of what r has to offer the help.search command alternatively allows searching for help in various ways for example solve try ?help.search for details and more examples the examples on a help topic can normally be run by
[close]
p. 11
chapter 1 introduction and preliminaries 5 exampletopic windows versions of r have other optional help systems use ?help for further details 1.8 r commands case sensitivity etc technically r is an expression language with a very simple syntax it is case sensitive as are most unix based packages so a and a are different symbols and would refer to different variables the set of symbols which can be used in r names depends on the operating system and country within which r is being run technically on the locale in use normally all alphanumeric symbols are allowed1 and in some countries this includes accented letters plus and with the restriction that a name must start with or a letter and if it starts with the second character must not be a digit names are currently effectively unlimited but were limited to 256 bytes prior to r 2.13.0 elementary commands consist of either expressions or assignments if an expression is given as a command it is evaluated printed unless specifically made invisible and the value is lost an assignment also evaluates an expression and passes the value to a variable but the result is not automatically printed commands are separated either by a semi-colon or by a newline elementary commands can be grouped together into one compound expression by braces and comments can be put almost2 anywhere starting with a hashmark everything to the end of the line is a comment if a command is not complete at the end of a line r will give a different prompt by default on second and subsequent lines and continue to read input until the command is syntactically complete this prompt may be changed by the user we will generally omit the continuation prompt and indicate continuation by simple indenting command lines entered at the console are limited3 to about 4095 bytes not characters 1.9 recall and correction of previous commands under many versions of unix and on windows r provides a mechanism for recalling and re-executing previous commands the vertical arrow keys on the keyboard can be used to scroll forward and backward through a command history once a command is located in this way the cursor can be moved within the command using the horizontal arrow keys and characters can be removed with the del key or added with the other keys more details are provided later see appendix c [the command-line editor page 96 the recall and editing capabilities under unix are highly customizable you can find out how to do this by reading the manual entry for the readline library 1 2 3 for portable r code including that to be used in r packages only azaz09 should be used not inside strings nor within the argument list of a function definition some of the consoles will not allow you to enter more and amongst those which do some will silently discard the excess and some will use it as the start of the next line.
[close]
p. 12
chapter 1 introduction and preliminaries 6 alternatively the emacs text editor provides more general support mechanisms via ess emacs speaks statistics for working interactively with r see section r and emacs in the r statistical system faq 1.10 executing commands from or diverting output to a file if commands4 are stored in an external file say `commands.r in the working directory `work they may be executed at any time in an r session with the command source commands.r for windows source is also available on the file menu the function sink sink record.lis will divert all subsequent output from the console to an external file `record.lis the command sink restores it to the console once again 1.11 data permanency and removing objects the entities that r creates and manipulates are known as objects these may be variables arrays of numbers character strings functions or more general structures built from such components during an r session objects are created and stored by name we discuss this process in the next session the r command objects alternatively ls can be used to display the names of most of the objects which are currently stored within r the collection of objects currently stored is called the workspace to remove objects the function rm is available rmx y z ink junk temp foo bar all objects created during an r session can be stored permanently in a file for use in future r sessions at the end of each r session you are given the opportunity to save all the currently available objects if you indicate that you want to do this the objects are written to a file called rdata 5 in the current directory and the command lines used in the session are saved to a file called rhistory when r is started at later time from the same directory it reloads the workspace from this file at the same time the associated commands history is reloaded it is recommended that you should use separate working directories for analyses conducted with r it is quite common for objects with names x and y to be created during an analysis names like this are often meaningful in the context of a single analysis but it can be quite hard to decide what they might be when the several analyses have been conducted in the same directory 4 5 of unlimited length the leading dot in this file name makes it invisible in normal file listings in unix.
[close]
p. 13
chapter 2 simple manipulations numbers and vectors 7 2 simple manipulations numbers and vectors 2.1 vectors and assignment r operates on named data structures the simplest such structure is the numeric vector which is a single entity consisting of an ordered collection of numbers to set up a vector named x say consisting of five numbers namely 10.4 5.6 3.1 6.4 and 21.7 use the r command x c10.4 5.6 3.1 6.4 21.7 this is an assignment statement using the function c which in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end.1 a number occurring by itself in an expression is taken as a vector of length one notice that the assignment operator which consists of the two characters less than and minus occurring strictly side-by-side and it `points to the object receiving the value of the expression in most contexts the operator can be used as an alternative assignment can also be made using the function assign an equivalent way of making the same assignment as above is with assign x c10.4 5.6 3.1 6.4 21.7 the usual operator can be thought of as a syntactic short-cut to this assignments can also be made in the other direction using the obvious change in the assignment operator so the same assignment could be made using c10.4 5.6 3.1 6.4 21.7 x if an expression is used as a complete command the value is printed and lost 2 so now if we were to use the command 1/x the reciprocals of the five values would be printed at the terminal and the value of x of course unchanged the further assignment y cx 0 x would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place 1 2 with other than vector types of argument such as list mode arguments the action of c is rather different see section 6.2.1 [concatenating lists page 29 actually it is still available as .last.value before any other statements are executed.
[close]
p. 14
chapter 2 simple manipulations numbers and vectors 8 2.2 vector arithmetic vectors can be used in arithmetic expressions in which case the operations are performed element by element vectors occurring in the same expression need not all be of the same length if they are not the value of the expression is a vector with the same length as the longest vector which occurs in the expression shorter vectors in the expression are recycled as often as need be perhaps fractionally until they match the length of the longest vector in particular a constant is simply repeated so with the above assignments the command v 2x y 1 generates a new vector v of length 11 constructed by adding together element by element 2x repeated 2.2 times y repeated just once and 1 repeated 11 times the elementary arithmetic operators are the usual and for raising to a power in addition all of the common arithmetic functions are available log exp sin cos tan sqrt and so on all have their usual meaning max and min select the largest and smallest elements of a vector respectively range is a function whose value is a vector of length two namely cminx maxx lengthx is the number of elements in x sumx gives the total of the elements in x and prodx their product two statistical functions are meanx which calculates the sample mean which is the same as sumx lengthx and varx which gives sum x-meanx 2 lengthx 1 or sample variance if the argument to var is an n-by-p matrix the value is a p-by-p sample covariance matrix got by regarding the rows as independent p-variate sample vectors sortx returns a vector of the same size as x with the elements arranged in increasing order however there are other more flexible sorting facilities available see order or sort.list which produce a permutation to do the sorting note that max and min select the largest and smallest values in their arguments even if they are given several vectors the parallel maximum and minimum functions pmax and pmin return a vector of length equal to their longest argument that contains in each element the largest smallest element in that position in any of the input vectors for most purposes the user will not be concerned if the numbers in a numeric vector are integers reals or even complex internally calculations are done as double precision real numbers or double precision complex numbers if the input data are complex to work with complex numbers supply an explicit complex part thus sqrt 17 will give nan and a warning but sqrt 17+0i will do the computations as complex numbers 2.3 generating regular sequences r has a number of facilities for generating commonly used sequences of numbers for example 1:30 is the vector c1 2 29 30 the colon operator has high priority within an expression so for example 21:15 is the vector c2 4 28 30 put n 10 and compare the sequences 1:n-1 and 1 n-1 the construction 30:1 may be used to generate a sequence backwards.
[close]
p. 15
chapter 2 simple manipulations numbers and vectors 9 the function seq is a more general facility for generating sequences it has five arguments only some of which may be specified in any one call the first two arguments if given specify the beginning and end of the sequence and if these are the only two arguments given the result is the same as the colon operator that is seq2,10 is the same vector as 2:10 parameters to seq and to many other r functions can also be given in named form in which case the order in which they appear is irrelevant the first two parameters may be named from=value and to=value thus seq1,30 seqfrom=1 to=30 and seqto=30 from=1 are all the same as 1:30 the next two parameters to seq may be named by=value and length=value which specify a step size and a length for the sequence respectively if neither of these is given the default by=1 is assumed for example seq 5 5 by 2 s3 generates in s3 the vector c 5.0 -4.8 -4.6 4.6 4.8 5.0 similarly s4 seqlength=51 from 5 by 2 generates the same vector in s4 the fifth parameter may be named along=vector which if used must be the only parameter and creates a sequence 1 2 lengthvector or the empty sequence if the vector is empty as it can be a related function is rep which can be used for replicating an object in various complicated ways the simplest form is s5 repx times=5 which will put five copies of x end-to-end in s5 another useful version is s6 repx each=5 which repeats each element of x five times before moving on to the next 2.4 logical vectors as well as numerical vectors r allows manipulation of logical quantities the elements of a logical vector can have the values true false and na for not available see below the first two are often abbreviated as t and f respectively note however that t and f are just variables which are set to true and false by default but are not reserved words and hence can be overwritten by the user hence you should always use true and false logical vectors are generated by conditions for example temp x 13 sets temp as a vector of the same length as x with values false corresponding to elements of x where the condition is not met and true where it is the logical operators are for exact equality and for inequality in addition if c1 and c2 are logical expressions then c1 c2 is their intersection and c1 c2 is their union or and !c1 is the negation of c1 logical vectors may be used in ordinary arithmetic in which case they are coerced into numeric vectors false becoming 0 and true becoming 1 however there are situations where logical vectors and their coerced numeric counterparts are not equivalent for example see the next subsection.
[close]