Learning Objectives
- Learn how to use the reticulate R package to work with python in R.
- Learn basic python syntax and how it compares to R.
- Learn how to write functions, conditional statements, loops in python.
- Learn how to use Python methods.
- Learn basic string manipulation in python.
- Learn how to run a Python script fromr R.
Suggested readings
- The reticulate R package documentation.
Python is another very popular computing language for data analysis and general purpose computing. Since R is the main language for this course, we will not cover all the many wonderous things that Python can do. Instead, we will introduce Python through the lens of how it is used for data analysis, with a particular focus on comparing its similarities and differences with R.
While you can work with Python in a number of ways, we will use the reticulate to access it directly from R!
To get started, install the package (remember, you only need to do this once on your computer):
install.package('reticulate')
Once installed, load the package:
library(reticulate)
If you already have Python installed on your computer, you should be okay, but you may see the following message pop up in the console:
Would you like to install Miniconda? [Y/n]:
If so, I recommend you go ahead and install Miniconda by typing
y
and pressing enter. Miniconda is a smaller version of the
larger “Conda”
distribution that most people use to install Python, and it is the
preferred setup for using Python in R.
Once you’ve loaded the reticulate library, use the following command to open up a Python REPL (which stands for “Read–Eval–Print-Loop”):
repl_python()
Now look at your console - you should see three
>>>
symbols. This means you’re now using Python!
(Remember, the R console has only one >
symbol).
Check your Python version!
Above the >>>
symbols, you should see a message
indicating which version of Python you are using. It should say
“Python 3….”. Python has two versions (2 and 3) - we’ll be
using Python 3. If you see Python 2, then you’ll need to adjust your configuration
to use Python 3. If you installed Miniconda, this should be Python
3.
If you want to get back to good ’ol R, just type the command
exit
into the Python console:
exit
Note that you should exit
and not exit()
with parenthesis.
Python has all the same arithmetic (+-*/
), relational
(<>=
), and logical (&|!
) operators
as R, but some of the symbols are a little different. Here’s a quick
comparison of these differences:
Arithmetic operators | R | Python |
---|---|---|
Integer division | %/% |
// |
Modulus | %% |
% |
Powers | ^ |
** |
Logical operators | R | Python |
---|---|---|
And | & |
& or and |
Or | | |
| or or |
Not | ! |
! or not |
Python uses the same symbols &
, |
, and
!
for assessing logical statements, but Python also
supports the use of the English words and
, or
,
and not
. For example, the following statements will both
return True
(3 == 3) & (4 == 4)
## True
(3 == 3) and (4 == 4)
## True
While in R you can use either =
or <-
to
assign values to objects, in Python only the =
symbol can
be used:
value = 3
value
## 3
For the most part, Python has the same data types as R: “numeric”, “string”, and “logical”. But they use different words to describe them:
Description | R | Python |
---|---|---|
numeric (w/decimal) | "double" |
"float" |
integer | "integer" |
"int" |
character | "character" |
"str" |
logical | "logical" |
"bool" |
There are three important distinctions between the languages on data types:
TRUE
and FALSE
(in all caps) to denote logical
statements that are “True” or “False”, but in Python you only capitalize
the first letter: True
or False
3
) are technically
floats. In Python, numbers are integers by default unless they have
decimal values (e.g. 3
is an int
type, but
3.14
is a float
type).NULL
, but in Python we use None
.You can check the type using typeof()
in R or
type()
in Python:
R:
typeof(3.14)
#> [1] "double"
typeof(3L)
#> [1] "integer"
typeof("3")
#> [1] "character"
typeof(TRUE)
#> [1] "logical"
Python:
type(3.14)
## <class 'float'>
type(3)
## <class 'int'>
type("3")
## <class 'str'>
type(True)
## <class 'bool'>
In R, you can convert data types using the general form of
as.something()
, replacing “something
” with a
data type. In Python, you can simply use the data type name to convert
types. Here’s a comparison:
R
Python
Convert to double / float:
as.double(3)
#> [1] 3
float(3)
## 3.0
Convert to integer:
as.integer(3.14)
#> [1] 3
int(3.14)
## 3
Convert to string:
as.character(3.14)
#> [1] "3.14"
str(3.14)
## '3.14'
Convert to logical:
as.logical(3.14)
#> [1] TRUE
bool(3.14)
## True
Remember that “logical” types convert to TRUE
for any
number other than 0
, which converts to
FALSE
.
Perhaps the biggest syntax difference between R and Python is that Python uses white space to define things.
For example, to write a loop in Python, you have to indent the second
line by four character spaces, otherwise you’ll get an error. The
benefits of this is that it forces you to use good style practices, and
you don’t have to use the {}
symbols like you do in R. The
downside is that if you have a single space character missing, you’ll
get an error, and sometimes that’s hard to notice.
Here’s a comparison of loops in R and Python:
R
Python
for
loop:
for (i in c(1,3,5)) {
print(i)
}
#> [1] 1
#> [1] 3
#> [1] 5
for i in [1,3,5]:
print(i)
## 1
## 3
## 5
while
loop:
i <- 1
while (i <= 5) {
print(i)
i <- i + 2
}
#> [1] 1
#> [1] 3
#> [1] 5
i = 1
while i <= 5:
print(i)
i = i + 2
## 1
## 3
## 5
One of the things many people love about Python is just how “clean”
the syntax looks. Compared to R, the Python code above is more compact
and contains less distracting elements, like the “{}
”
symbols. You also don’t need to include ()
symbols on the
first line.
Other than these differences in syntax, loops are essentially the same across the two languages.
Functions use the same “spacing” format as loops, and again the
Python syntax is more compact. Here’s a comparison of the
isEven(n)
function:
R:
isEven <- function(n) {
if (n %% 2 == 0) {
return(TRUE)
}
return(FALSE)
}
Python:
def isEven(n):
if (n % 2 == 0):
return(True)
return(False)
Note the difference in the ordering of the first lines. In R, you first define the function name, then you assign to that name the function and argument(s).
In Python, you do not use any assignment to create a function.
Rather, you use the command def
followed by the function
name and argument(s). Here, the Python syntax is quite natural - you use
the same syntax that you would use when calling the function
(e.g. isEven(n)
).
Note also that the if
statement in Python also uses the
same general syntax of indented white space instead of using the
{}
symbols.
You might have heard people (i.e. me) say that Python is more “object-oriented” whereas R is more “functional.” What I mean is that in R you mostly apply functions to objects, but in Python you often call special functions that belong to certain object types. Here’s an example of converting a string to upper case:
R: We use the string "foo"
as an
argument to the str_to_upper()
function from the
stringr library, which returns "FOO"
.
stringr::str_to_upper("foo")
#> [1] "FOO"
Python: we use the .upper()
method that belongs to the string "foo"
, which
returns "FOO"
. All strings in Python have this method.
"foo".upper()
## 'FOO'
Methods are special functions that belong to objects of a certain
class. You “call” methods using the name of the object followed
by the .
symbol, like this:
object.method()
You can also see the different methods available for a particular
object by calling the dir
function on the object:
s = "foo"
dir(s)
## ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
Wow, strings have a lot of methods!
The concept of using methods is a major part of the “object-oriented” way of programming, since it’s the object that is the center of attention. The object in Python is more than just a stored value - it’s a source of other methods (depending on the object’s class).
Now that you’ve seen a little about how Python methods work, we’ll get to use some working with strings!
String manipulation is one area where more substantial differences emerge between Python and R. Because R’s built in functions for dealing with strings are rather unintuitive, we’ve relied on the stringr package.
In Python, many of the basic string manipulations are actually done with basic arithmatic operators, just like with numbers. Here are a few comparisons:
R
Python
String concatenation:
In R, we use the function paste()
to combine
strings:
paste("foo", "bar", sep = "")
#> [1] "foobar"
In Python, you can combine strings by “adding” them together. The default is to merge them with no space in between:
"foo" + "bar"
## 'foobar'
String repetition:
Creating a repeated string is even more complicated in R. You first
have to create a vector of repeated strings, and then
“collapse” them using the paste()
function:
paste(rep("foo", 3), collapse = '')
#> [1] "foofoofoo"
In Python, you can just “multiply” the string, like this:
"foo" * 3
## 'foofoofoo'
Sub-string detection:
In R, we use the str_detect()
function:
str_detect('Apple', 'ppl')
#> [1] TRUE
In Python, you can detect sub-strings using the in
operator:
'ppl' in 'Apple'
## True
Because Python has both functions and object methods, it can sometimes be tricky to remember which to use for a specific purpose. For example, if you want to know how many characters are in a string, you use a function, just like in R:
R
Python
String length:
str_length('foo')
#> [1] 3
len('foo')
## 3
However, lots of basic string manipulations are done with string methods:
R
Python
Case converstion:
s <- "A longer string"
str_to_upper(s)
#> [1] "A LONGER STRING"
str_to_lower(s)
#> [1] "a longer string"
str_to_title(s)
#> [1] "A Longer String"
s = "A longer string"
s.upper()
## 'A LONGER STRING'
s.lower()
## 'a longer string'
s.title()
## 'A Longer String'
Remove excess white space:
s <- " A string with space "
str_trim(s)
#> [1] "A string with space"
s = " A string with space "
s.strip()
## 'A string with space'
Detect if string contains only numbers:
R doesn’t have a function for this, but you can convert it to a
number and check if the result is not NA
:
s <- "42"
!is.na(as.numeric(s))
#> [1] TRUE
Python has some handy string methods!
s = "42"
s.isnumeric()
## True
To extract a sub-string, in R we have to use the
str_sub()
function. But in Python, you can simply use the
[]
symbols. In either case, you have to provide indices of
where to start and stop the “slice”.
For example, here’s how to extract the sub-string "App"
from "Apple"
in each language:
R:
s <- "Apple"
str_sub(s, 1, 3)
#> [1] "App"
Python:
s = "Apple"
s[0:3]
## 'App'
Note that we had to use a different starting index here to get the same sub-string in each language. That’s because indexing starts at 0 in Python.
If this seems strange, just imagine “fence posts”. In Python, the
elements in a sequence are like items sitting between fence
posts. So the index of each character in the string "Apple"
look like this:
index: 0 1 2 3 4 5
| | | | | |
| "A" | "p" | "p" | "l" | "e" |
| | | | | |
When you make a slice in Python, you slice at the fence post
number to get the elements between the posts. So in this case,
if we want to get the sub-string "App"
from
"Apple"
, we need to slice from the post 0
to
3
.
Negative indices are also handled differently.
R: Negative indices start from the end of the string inclusively:
str_sub(s, -1)
#> [1] "e"
str_sub(s, -3)
#> [1] "ple"
Python: Negative indices start from the end of the string, but only return the character at that index:
s[-1]
## 'e'
s[-3]
## 'p'
To get an inclusive string, you have to provide a starting and ending index:
s[-3:-1]
## 'pl'
s[-3:5]
## 'ple'
You can get the index of a character or sub-string in Python using
the .index()
method:
R: Returns the starting and ending indices of the sub-string
str_locate(s, "pp")
#> start end
#> [1,] 2 3
Python: Returns only the starting index of the sub-string
s.index("pp")
## 1
Like in R, splitting a string returns a list of strings. Python lists are similar to R lists, but they only have single brackets. Here’s an example:
R:
s <- "Apple"
str_split(s, "pp")
#> [[1]]
#> [1] "A" "le"
Python:
s = "Apple"
s.split("pp")
## ['A', 'le']
In both languages, the returned list contains the remaining
characters after splitting the string (in this case, "A"
and "le"
). One main difference though is that R returns a
list of vectors, so to access the returned vector containing
"A"
and "le"
you have to access the first
element in the list, like this:
str_split(s, "pp")[[1]]
#> [1] "A" "le"
This is because in R the str_split()
function is
vectorized, meaning that the function can also be performed on
a vector of strings, like this:
s <- c("Apple", "Snapple")
str_split(s, "pp")
#> [[1]]
#> [1] "A" "le"
#>
#> [[2]]
#> [1] "Sna" "le"
In this example, it’s easier to see that R is returning a list of vectors. In contrast, Python cannot perform a split on multiple strings:
s = ["Apple", "Snapple"]
s.split("pp")
## AttributeError: 'list' object has no attribute 'split'
To handle this, you will need to import the numpy package, which has an “array” structure similar to R vectors (we’ll cover this in more detail on week 13). Here’s an example:
import numpy as np
s = np.array(["Apple", "Snapple"])
np.char.split(s, "pp")
## array([list(['A', 'le']), list(['Sna', 'le'])], dtype=object)
While R scripts end in .R
, Python scripts end in
.py
. You can open up and save a blank Python script in
RStudio by clicking
File -> New File -> Python Script
Save it as foo.py
in your project folder. Now that it’s
saved, let’s add some code to run. As a quick example, I’m going to add
code defining the function isOdd()
and then create a few
values testing it:
def isOdd(n):
if (n % 2 == 1):
return(True)
return(False)
n1 = isOdd(4)
n2 = isOdd(3)
Now that you have this code stored in your foo.py
file,
you can source the file from inside R, like this:
reticulate::source_python('foo.py')
Magically, the function isOdd()
and the objects we
created (n1
and n2
) are now accessible from
R!
isOdd(7)
## [1] TRUE
n1
## [1] FALSE
n2
## [1] TRUE
You can get really creative with object-oriented programming in
Python by creating your own custom classes, allowing you to embed values
and methods that belong only to objects of that class. For example,
here’s how to create a class called Animal
, which is
defined by two values: species
and sound
. Note
the white space indentations - without them Python will error:
class Animal:
def __init__(self, species, sound):
self.species = species
self.sound = sound
The first function in any custom class is the __init__
function. This is where to define any arguments that must be input when
defining an object of the custom class. The use of self
here defines which methods and values will be stored in the object onces
it’s created.
Here’s a example of how we could use the Animal
class:
riley = Animal("Dog", "Woof")
Here I’ve defined an object named riley
(my dog’s name),
and it has two stored values: "Dog"
(the species) and
"Woof"
(the sound). I can access these stored values by
calling the species
and sound
values from the
riley
object:
riley.species
## 'Dog'
riley.sound
## 'Woof'
I can also ask Python what type of object riley
is, and
it will tell me it’s of the Animal
class:
type(riley)
<class '__main__.Animal'>
In addition to just storing values, you can create custom methods
that will only be accessible to objects of the custom class. Here I’m
adding the method introduce()
to the class
Animal
:
class Animal:
def __init__(self, species, sound):
self.species = species
self.sound = sound
def introduce(self):
print("I'm a " + self.species + " and I say " + self.sound)
Now let’s re-define my riley
object and try out our new
method!
riley = Animal("Dog", "Woof")
riley.introduce()
## I'm a Dog and I say Woof