File Import & Export in R

Author: Team BioSakshat

Last update: June 2017

Input files

Input_1.txt
Input_2.txt
Input_3.txt
Input_4.txt Input_3.xlsx
1BUW.pdb

Read tabular data using read.table()

If data is well structured in tabular form, we can use read.table() to read the data.

Read data

In file Input_1.txt all rows have equal numbers of columns. Each cell is separated by tab. Try ?read.table() to check the default values for arguments.

Default: Header=FALSE, sep=" “, stringsAsFactors=T

in1 = read.table("_site/data/Day1/Input_1.txt");
in1;

##             V1          V2           V3          V4      V5
## 1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2          5.1         3.5          1.4         0.2  setosa
## 3          4.9           3          1.4         0.2  setosa
## 4          4.7         3.2          1.3         0.2  setosa
## 5          4.6         3.1          1.5         0.2  setosa
## 6            5         3.6          1.4         0.2  setosa
## 7          5.4         3.9          1.7         0.4  setosa

str(in1);

## 'data.frame':    7 obs. of  5 variables:
##  $ V1: Factor w/ 7 levels "4.6","4.7","4.9",..: 7 5 3 2 1 4 6
##  $ V2: Factor w/ 7 levels "3","3.1","3.2",..: 7 4 1 3 2 5 6
##  $ V3: Factor w/ 5 levels "1.3","1.4","1.5",..: 5 2 2 1 3 2 4
##  $ V4: Factor w/ 3 levels "0.2","0.4","Petal.Width": 3 1 1 1 1 1 2
##  $ V5: Factor w/ 2 levels "setosa","Species": 2 1 1 1 1 1 1

We can see that header is considered as 1st row which is what we dont want.

header=TRUE

in2 = read.table("_site/data/Day1/Input_1.txt", header = TRUE);
in2;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(in2);

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : Factor w/ 1 level "setosa": 1 1 1 1 1 1

Header=T allows to read first row in a file as column names vector.

Note the structure of in2. The first four columns are numeric as expected. But Species column has been considered as factors. Species column is considered as factor (categorical variable). If we dont want to read character data type as factor, we can explore stringsAsFactors = FALSE parameter, as shown below.

stringsAsFactors = FALSE

in3 = read.table("_site/data/Day1/Input_1.txt", header = TRUE, stringsAsFactors = FALSE);
in3;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(in3);

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

stringsAsFactors=F disables factor formatting of character columns. Check the data types of Species (chr). The first four columns are numeric. Now Species column has been considered as a character vector.

Other useful parameters:

sep = “”
comment.char = “#”
na.strings = “NA”
quote = “"’”
row.names
col.names
blank.lines.skip = TRUE

Read file with unequal columns in first row using read.table.

Note that in file Input_2.txt, first row has 5 column fields while remaining rows have 6 fields/values i.e. first row has one column less than other rows. Under such format Header is automatically set to TRUE by read.table(). So in the below code, we didnt specify header=TRUE (optional here).

in4 = read.table("_site/data/Day1/Input_2.txt", stringsAsFactors = FALSE);
in4;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(in4);

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

Read file consisting of comments, blank lines, null values

Default arguments: skip=0, comment.char=“#”, na.strings=“NA”

Please note that in file Input_3.txt consists of

First two lines inserted by author but are not required for processing data.
2 comment lines starting with character “!”
1 blank line
Row 3 has one missing value shown by NULL
First row has 5 columns while remaining rows havecolumns. Under such input format Header is set to TRUE by read.table().

in5 = read.table("_site/data/Day1/Input_3.txt", stringsAsFactors = FALSE, comment.char = "!", na.strings = NULL, skip=2);
in5;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3         NULL         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5            5         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(in5);

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: chr  "5.1" "4.9" "NULL" "4.6" ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

skip=2 removes first two lines not required for processing data by R.
comment.char=“!”, instructs R to exclude lines starting with “!”. Default comment.char=“#”
na.strings= “NULL” allows R to interpret NULL values as missing or NA characters, which will be helpful to remove such rows by na.omit() function. Default na.strings= “NA”.

Read csv file read.csv()

Please note that values in file Input_4.txt are separated by comma, “,” and all rows have equal number of columns. See help for read.csv to check the default values for arguments. Default Header=TRUE, sep=“,”

in7= read.csv("_site/data/Day1/Input_4.txt", stringsAsFactors = FALSE, comment.char = "!", na.strings = NULL, skip=2);
in7;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5        NULL  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(in7);

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal.Width : chr  "0.2" "0.2" "0.2" "NULL" ...
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

Reading data from excel file

We need to use gdata package to read excel file. To use, we must have perl installed in the system.

library("gdata");

## Warning: package 'gdata' was built under R version 3.3.3

## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

##

## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

## 
## Attaching package: 'gdata'

## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last

## The following object is masked from 'package:stats':
## 
##     nobs

## The following object is masked from 'package:utils':
## 
##     object.size

## The following object is masked from 'package:base':
## 
##     startsWith

xl=read.xls("_site/data/Day1/Input_3.xlsx", sheet=1);
xl;

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(xl);

## 'data.frame':    6 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : Factor w/ 1 level "setosa": 1 1 1 1 1 1

library(gdata) loads the gdata package to access read.xls() function. sheet option allows to choose sheet from input excel file.

Read data using readLines()

By default readLines() read all the lines of a file. It returns a character vector. Check the str(ln). n=10 allows to read first 10 lines.

pdb=readLines("_site/data/Day1/1BUW.pdb");
str(pdb);

##  chr [1:5484] "HEADER    OXYGEN STORAGE/TRANSPORT                06-SEP-98   1BUW              " ...

pdb=readLines("_site/data/Day1/1BUW.pdb", n=10);
pdb;

##  [1] "HEADER    OXYGEN STORAGE/TRANSPORT                06-SEP-98   1BUW              "
##  [2] "TITLE     CRYSTAL STRUCTURE OF S-NITROSO-NITROSYL HUMAN HEMOGLOBIN A            "
##  [3] "COMPND    MOL_ID: 1;                                                            "
##  [4] "COMPND   2 MOLECULE: PROTEIN (HEMOGLOBIN);                                      "
##  [5] "COMPND   3 CHAIN: A, C;                                                         "
##  [6] "COMPND   4 SYNONYM: S-NITROSO-NITROSYLHB;                                       "
##  [7] "COMPND   5 OTHER_DETAILS: THE SULFHYDRYL GROUPS OF CYSTEINE 93 OF               "
##  [8] "COMPND   6 BETA SUBUNITS ARE S-NITROSYLATED. THE HEME GROUPS ARE                "
##  [9] "COMPND   7 NITROSYLATED.;                                                       "
## [10] "COMPND   8 MOL_ID: 2;                                                           "

Read data using clipboard feature

Read copied text using clipboard

data=read.table("clipboard");

View data frame using View()

View(in1);

Edit data frame using edit()

edit(in1);

Write R data frames in a file using write.table()

write.table(in1, file="result.txt", sep="\t", eol="\n", quote=FALSE, row.names=FALSE, append = FALSE);

Write using cat()

cat("Hello", file="result.txt");