Learning Objectives
- Learn how to use the reticulate R package to work with python in R.
- Learn basic python syntax and how it compares to R.
- Learn how to write functions, conditional statements, loops in python.
- Learn how to use Python methods.
- Learn basic string manipulation in python.
- Learn how to run a Python script fromr R.
Suggested readings
- The reticulate R package documentation.
Python is another very popular computing language for data analysis and general purpose computing. Since R is the main language for this course, we will not cover all the many wonderous things that Python can do. Instead, we will introduce Python through the lens of how it is used for data analysis, with a particular focus on comparing its similarities and differences with R.
While you can work with Python in a number of ways, we will use the reticulate to access it directly from R!
To get started, install the package (remember, you only need to do this once on your computer):
install.package('reticulate')
Once installed, load the package:
library(reticulate)
If you already have Python installed on your computer, you should be okay, but you may see the following message pop up in the console:
Would you like to install Miniconda? [Y/n]:
If so, I recommend you go ahead and install Miniconda by typing y
and pressing enter. Miniconda is a smaller version of the larger “Conda” distribution that most people use to install Python, and it is the preferred setup for using Python in R.
Once you’ve loaded the reticulate library, use the following command to open up a Python REPL (which stands for “Read–Eval–Print-Loop”):
repl_python()
Now look at your console - you should see three >>>
symbols. This means you’re now using Python! (Remember, the R console has only one >
symbol).
Check your Python version!
Above the >>>
symbols, you should see a message indicating which version of Python you are using. It should say “Python 3….”. Python has two versions (2 and 3) - we’ll be using Python 3. If you see Python 2, then you’ll need to adjust your configuration to use Python 3. If you installed Miniconda, this should be Python 3.
If you want to get back to good ’ol R, just type the command exit
into the Python console:
exit
Note that you should exit
and not exit()
with parenthesis.
Python has all the same arithmetic (+-*/
), relational (<>=
), and logical (&|!
) operators as R, but some of the symbols are a little different. Here’s a quick comparison of these differences:
Arithmetic operators | R | Python |
---|---|---|
Integer division | %/% |
// |
Modulus | %% |
% |
Powers | ^ |
** |
Logical operators | R | Python |
---|---|---|
And | & |
& or and |
Or | | |
| or or |
Not | ! |
! or not |
Python uses the same symbols &
, |
, and !
for assessing logical statements, but Python also supports the use of the English words and
, or
, and not
. For example, the following statements will both return True
(3 == 3) & (4 == 4)
## True
(3 == 3) and (4 == 4)
## True
While in R you can use either =
or <-
to assign values to objects, in Python only the =
symbol can be used:
value = 3
value
## 3
For the most part, Python has the same data types as R: “numeric”, “string”, and “logical”. But they use different words to describe them:
Description | R | Python |
---|---|---|
numeric (w/decimal) | "double" |
"float" |
integer | "integer" |
"int" |
character | "character" |
"str" |
logical | "logical" |
"bool" |
There are three important distinctions between the languages on data types:
TRUE
and FALSE
(in all caps) to denote logical statements that are “True” or “False”, but in Python you only capitalize the first letter: True
or False
3
) are technically floats. In Python, numbers are integers by default unless they have decimal values (e.g. 3
is an int
type, but 3.14
is a float
type).NULL
, but in Python we use None
.You can check the type using typeof()
in R or type()
in Python:
R:
typeof(3.14)
## [1] "double"
typeof(3L)
## [1] "integer"
typeof("3")
## [1] "character"
typeof(TRUE)
## [1] "logical"
Python:
type(3.14)
## <class 'float'>
type(3)
## <class 'int'>
type("3")
## <class 'str'>
type(True)
## <class 'bool'>
In R, you can convert data types using the general form of as.something()
, replacing “something
” with a data type. In Python, you can simply use the data type name to convert types. Here’s a comparison:
R
Python
Convert to double / float:
as.double(3)
## [1] 3
float(3)
## 3.0
Convert to integer:
as.integer(3.14)
## [1] 3
int(3.14)
## 3
Convert to string:
as.character(3.14)
## [1] "3.14"
str(3.14)
## '3.14'
Convert to logical:
as.logical(3.14)
## [1] TRUE
bool(3.14)
## True
Remember that “logical” types convert to TRUE
for any number other than 0
, which converts to FALSE
.
Perhaps the biggest syntax difference between R and Python is that Python uses white space to define things.
For example, to write a loop in Python, you have to indent the second line by four character spaces, otherwise you’ll get an error. The benefits of this is that it forces you to use good style practices, and you don’t have to use the {}
symbols like you do in R. The downside is that if you have a single space character missing, you’ll get an error, and sometimes that’s hard to notice.
Here’s a comparison of loops in R and Python:
R
Python
for
loop:
for (i in c(1,3,5)) {
print(i)
}
## [1] 1
## [1] 3
## [1] 5
for i in [1,3,5]:
print(i)
## 1
## 3
## 5
while
loop:
i <- 1
while (i <= 5) {
print(i)
i <- i + 2
}
## [1] 1
## [1] 3
## [1] 5
i = 1
while i <= 5:
print(i)
i = i + 2
## 1
## 3
## 5
One of the things many people love about Python is just how “clean” the syntax looks. Compared to R, the Python code above is more compact and contains less distracting elements, like the “{}
” symbols. You also don’t need to include ()
symbols on the first line.
Other than these differences in syntax, loops are essentially the same across the two languages.
Functions use the same “spacing” format as loops, and again the Python syntax is more compact. Here’s a comparison of the isEven(n)
function:
R:
isEven <- function(n) {
if (n %% 2 == 0) {
return(TRUE)
}
return(FALSE)
}
Python:
def isEven(n):
if (n % 2 == 0):
return(True)
return(False)
Note the difference in the ordering of the first lines. In R, you first define the function name, then you assign to that name the function and argument(s).
In Python, you do not use any assignment to create a function. Rather, you use the command def
followed by the function name and argument(s). Here, the Python syntax is quite natural - you use the same syntax that you would use when calling the function (e.g. isEven(n)
).
Note also that the if
statement in Python also uses the same general syntax of indented white space instead of using the {}
symbols.
You might have heard people (i.e. me) say that Python is more “object-oriented” whereas R is more “functional.” What I mean is that in R you mostly apply functions to objects, but in Python you often call special functions that belong to certain object types. Here’s an example of converting a string to upper case:
R: We use the string "foo"
as an argument to the str_to_upper()
function from the stringr library, which returns "FOO"
.
stringr::str_to_upper("foo")
## [1] "FOO"
Python: we use the .upper()
method that belongs to the string "foo"
, which returns "FOO"
. All strings in Python have this method.
"foo".upper()
## 'FOO'
Methods are special functions that belong to objects of a certain class. You “call” methods using the name of the object followed by the .
symbol, like this:
object.method()
You can also see the different methods available for a particular object by calling the dir
function on the object:
s = "foo"
dir(s)
## ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Wow, strings have a lot of methods!
The concept of using methods is a major part of the “object-oriented” way of programming, since it’s the object that is the center of attention. The object in Python is more than just a stored value - it’s a source of other methods (depending on the object’s class).
Now that you’ve seen a little about how Python methods work, we’ll get to use some working with strings!
String manipulation is one area where more substantial differences emerge between Python and R. Because R’s built in functions for dealing with strings are rather unintuitive, we’ve relied on the stringr package.
In Python, many of the basic string manipulations are actually done with basic arithmatic operators, just like with numbers. Here are a few comparisons:
R
Python
String concatenation:
In R, we use the function paste()
to combine strings:
paste("foo", "bar", sep = "")
## [1] "foobar"
In Python, you can combine strings by “adding” them together. The default is to merge them with no space in between:
"foo" + "bar"
## 'foobar'
String repetition:
Creating a repeated string is even more complicated in R. You first have to create a vector of repeated strings, and then “collapse” them using the paste()
function:
paste(rep("foo", 3), collapse = '')
## [1] "foofoofoo"
In Python, you can just “multiply” the string, like this:
"foo" * 3
## 'foofoofoo'
Sub-string detection:
In R, we use the str_detect()
function:
str_detect('Apple', 'ppl')
## [1] TRUE
In Python, you can detect sub-strings using the in
operator:
'ppl' in 'Apple'
## True
Because Python has both functions and object methods, it can sometimes be tricky to remember which to use for a specific purpose. For example, if you want to know how many characters are in a string, you use a function, just like in R:
R
Python
String length:
str_length('foo')
## [1] 3
len('foo')
## 3
However, lots of basic string manipulations are done with string methods:
R
Python
Case converstion:
s <- "A longer string"
str_to_upper(s)
## [1] "A LONGER STRING"
str_to_lower(s)
## [1] "a longer string"
str_to_title(s)
## [1] "A Longer String"
s = "A longer string"
s.upper()
## 'A LONGER STRING'
s.lower()
## 'a longer string'
s.title()
## 'A Longer String'
Remove excess white space:
s <- " A string with space "
str_trim(s)
## [1] "A string with space"
s = " A string with space "
s.strip()
## 'A string with space'
Detect if string contains only numbers:
R doesn’t have a function for this, but you can convert it to a number and check if the result is not NA
:
s <- "42"
!is.na(as.numeric(s))
## [1] TRUE
Python has some handy string methods!
s = "42"
s.isnumeric()
## True
To extract a sub-string, in R we have to use the str_sub()
function. But in Python, you can simply use the []
symbols. In either case, you have to provide indices of where to start and stop the “slice”.
For example, here’s how to extract the sub-string "App"
from "Apple"
in each language:
R:
s <- "Apple"
str_sub(s, 1, 3)
## [1] "App"
Python:
s = "Apple"
s[0:3]
## 'App'
Note that we had to use a different starting index here to get the same sub-string in each language. That’s because indexing starts at 0 in Python.
If this seems strange, just imagine “fence posts”. In Python, the elements in a sequence are like items sitting between fence posts. So the index of each character in the string "Apple"
look like this:
index: 0 1 2 3 4 5
| | | | | |
| "A" | "p" | "p" | "l" | "e" |
| | | | | |
When you make a slice in Python, you slice at the fence post number to get the elements between the posts. So in this case, if we want to get the sub-string "App"
from "Apple"
, we need to slice from the post 0
to 3
.
Negative indices are also handled differently.
R: Negative indices start from the end of the string inclusively:
str_sub(s, -1)
## [1] "e"
str_sub(s, -3)
## [1] "ple"
Python: Negative indices start from the end of the string, but only return the character at that index:
s[-1]
## 'e'
s[-3]
## 'p'
To get an inclusive string, you have to provide a starting and ending index:
s[-3:-1]
## 'pl'
s[-3:5]
## 'ple'
You can get the index of a character or sub-string in Python using the .index()
method:
R: Returns the starting and ending indices of the sub-string
str_locate(s, "pp")
## start end
## [1,] 2 3
Python: Returns only the starting index of the sub-string
s.index("pp")
## 1
Like in R, splitting a string returns a list of strings. Python lists are similar to R lists, but they only have single brackets. Here’s an example:
R:
s <- "Apple"
str_split(s, "pp")
## [[1]]
## [1] "A" "le"
Python:
s = "Apple"
s.split("pp")
## ['A', 'le']
In both languages, the returned list contains the remaining characters after splitting the string (in this case, "A"
and "le"
). One main difference though is that R returns a list of vectors, so to access the returned vector containing "A"
and "le"
you have to access the first element in the list, like this:
str_split(s, "pp")[[1]]
## [1] "A" "le"
This is because in R the str_split()
function is vectorized, meaning that the function can also be performed on a vector of strings, like this:
s <- c("Apple", "Snapple")
str_split(s, "pp")
## [[1]]
## [1] "A" "le"
##
## [[2]]
## [1] "Sna" "le"
In this example, it’s easier to see that R is returning a list of vectors. In contrast, Python cannot perform a split on multiple strings:
s = ["Apple", "Snapple"]
s.split("pp")
## AttributeError: 'list' object has no attribute 'split'
To handle this, you will need to import the numpy package, which has an “array” structure similar to R vectors (we’ll cover this in more detail on week 13). Here’s an example:
import numpy as np
s = np.array(["Apple", "Snapple"])
np.char.split(s, "pp")
## array([list(['A', 'le']), list(['Sna', 'le'])], dtype=object)
While R scripts end in .R
, Python scripts end in .py
. You can open up and save a blank Python script in RStudio by clicking
File -> New File -> Python Script
Save it as foo.py
in your project folder. Now that it’s saved, let’s add some code to run. As a quick example, I’m going to add code defining the function isOdd()
and then create a few values testing it:
def isOdd(n):
if (n % 2 == 1):
return(True)
return(False)
n1 = isOdd(4)
n2 = isOdd(3)
Now that you have this code stored in your foo.py
file, you can source the file from inside R, like this:
reticulate::source_python('foo.py')
Magically, the function isOdd()
and the objects we created (n1
and n2
) are now accessible from R!
isOdd(7)
## [1] TRUE
n1
## [1] FALSE
n2
## [1] TRUE
You can get really creative with object-oriented programming in Python by creating your own custom classes, allowing you to embed values and methods that belong only to objects of that class. For example, here’s how to create a class called Animal
, which is defined by two values: species
and sound
. Note the white space indentations - without them Python will error:
class Animal:
def __init__(self, species, sound):
self.species = species
self.sound = sound
The first function in any custom class is the __init__
function. This is where to define any arguments that must be input when defining an object of the custom class. The use of self
here defines which methods and values will be stored in the object onces it’s created.
Here’s a example of how we could use the Animal
class:
riley = Animal("Dog", "Woof")
Here I’ve defined an object named riley
(my dog’s name), and it has two stored values: "Dog"
(the species) and "Woof"
(the sound). I can access these stored values by calling the species
and sound
values from the riley
object:
riley.species
## 'Dog'
riley.sound
## 'Woof'
I can also ask Python what type of object riley
is, and it will tell me it’s of the Animal
class:
type(riley)
<class '__main__.Animal'>
In addition to just storing values, you can create custom methods that will only be accessible to objects of the custom class. Here I’m adding the method introduce()
to the class Animal
:
class Animal:
def __init__(self, species, sound):
self.species = species
self.sound = sound
def introduce(self):
print("I'm a " + self.species + " and I say " + self.sound)
Now let’s re-define my riley
object and try out our new method!
riley = Animal("Dog", "Woof")
riley.introduce()
## I'm a Dog and I say Woof