Learning Objectives

  • Learn how to use the reticulate R package to work with python in R.
  • Learn basic python syntax and how it compares to R.
  • Learn how to write functions, conditional statements, loops in python.
  • Learn how to use Python methods.
  • Learn basic string manipulation in python.
  • Learn how to run a Python script fromr R.

Suggested readings

Getting started with Python (in R)

Python is another very popular computing language for data analysis and general purpose computing. Since R is the main language for this course, we will not cover all the many wonderous things that Python can do. Instead, we will introduce Python through the lens of how it is used for data analysis, with a particular focus on comparing its similarities and differences with R.

While you can work with Python in a number of ways, we will use the reticulate to access it directly from R!

Installation

To get started, install the package (remember, you only need to do this once on your computer):

install.package('reticulate')

Once installed, load the package:

library(reticulate)

If you already have Python installed on your computer, you should be okay, but you may see the following message pop up in the console:

Would you like to install Miniconda? [Y/n]:

If so, I recommend you go ahead and install Miniconda by typing y and pressing enter. Miniconda is a smaller version of the larger “Conda” distribution that most people use to install Python, and it is the preferred setup for using Python in R.

Starting Python

Once you’ve loaded the reticulate library, use the following command to open up a Python REPL (which stands for “Read–Eval–Print-Loop”):

repl_python()

Now look at your console - you should see three >>> symbols. This means you’re now using Python! (Remember, the R console has only one > symbol).

Check your Python version!

Above the >>> symbols, you should see a message indicating which version of Python you are using. It should say “Python 3….”. Python has two versions (2 and 3) - we’ll be using Python 3. If you see Python 2, then you’ll need to adjust your configuration to use Python 3. If you installed Miniconda, this should be Python 3.

Exiting Python

If you want to get back to good ’ol R, just type the command exit into the Python console:

exit

Note that you should exit and not exit() with parenthesis.

Python basics

Operators

Python has all the same arithmetic (+-*/), relational (<>=), and logical (&|!) operators as R, but some of the symbols are a little different. Here’s a quick comparison of these differences:

Arithmetic operators R Python
Integer division %/% //
Modulus %% %
Powers ^ **
Logical operators R Python
And & & or and
Or | | or or
Not ! ! or not

Python uses the same symbols &, |, and ! for assessing logical statements, but Python also supports the use of the English words and, or, and not. For example, the following statements will both return True

(3 == 3) & (4 == 4)
## True
(3 == 3) and (4 == 4)
## True

Variable assignment

While in R you can use either = or <- to assign values to objects, in Python only the = symbol can be used:

value = 3
value
## 3

Data types

For the most part, Python has the same data types as R: “numeric”, “string”, and “logical”. But they use different words to describe them:

Description R Python
numeric (w/decimal) "double" "float"
integer "integer" "int"
character "character" "str"
logical "logical" "bool"

There are three important distinctions between the languages on data types:

  1. Logicals: Logical statements in R use the words TRUE and FALSE (in all caps) to denote logical statements that are “True” or “False”, but in Python you only capitalize the first letter: True or False
  2. Integers vs. Floats: In R, all numbers are “floats” by default (i.e. they have decimal places), so even numbers that look like integers (e.g. 3) are technically floats. In Python, numbers are integers by default unless they have decimal values (e.g. 3 is an int type, but 3.14 is a float type).
  3. NULL: In R, a value of “nothing” is represented by NULL, but in Python we use None.

You can check the type using typeof() in R or type() in Python:

R:

typeof(3.14)
#> [1] "double"
typeof(3L)
#> [1] "integer"
typeof("3")
#> [1] "character"
typeof(TRUE)
#> [1] "logical"

Python:

type(3.14)
## <class 'float'>
type(3)
## <class 'int'>
type("3")
## <class 'str'>
type(True)
## <class 'bool'>

Coercing data types

In R, you can convert data types using the general form of as.something(), replacing “something” with a data type. In Python, you can simply use the data type name to convert types. Here’s a comparison:

R

Python

Convert to double / float:

as.double(3)
#> [1] 3
float(3)
## 3.0

Convert to integer:

as.integer(3.14)
#> [1] 3
int(3.14)
## 3

Convert to string:

as.character(3.14)
#> [1] "3.14"
str(3.14)
## '3.14'

Convert to logical:

as.logical(3.14)
#> [1] TRUE
bool(3.14)
## True

Remember that “logical” types convert to TRUE for any number other than 0, which converts to FALSE.

Loops

Perhaps the biggest syntax difference between R and Python is that Python uses white space to define things.

For example, to write a loop in Python, you have to indent the second line by four character spaces, otherwise you’ll get an error. The benefits of this is that it forces you to use good style practices, and you don’t have to use the {} symbols like you do in R. The downside is that if you have a single space character missing, you’ll get an error, and sometimes that’s hard to notice.

Here’s a comparison of loops in R and Python:

R

Python

for loop:

for (i in c(1,3,5)) {
    print(i)
}
#> [1] 1
#> [1] 3
#> [1] 5
for i in [1,3,5]:
    print(i)
## 1
## 3
## 5

while loop:

i <- 1
while (i <= 5) {
    print(i)
    i <- i + 2
}
#> [1] 1
#> [1] 3
#> [1] 5
i = 1
while i <= 5:
    print(i)
    i = i + 2
## 1
## 3
## 5

One of the things many people love about Python is just how “clean” the syntax looks. Compared to R, the Python code above is more compact and contains less distracting elements, like the “{}” symbols. You also don’t need to include () symbols on the first line.

Other than these differences in syntax, loops are essentially the same across the two languages.

Functions

Functions use the same “spacing” format as loops, and again the Python syntax is more compact. Here’s a comparison of the isEven(n) function:

R:

isEven <- function(n) {
    if (n %% 2 == 0) {
        return(TRUE)
    }
    return(FALSE)
}

Python:

def isEven(n):
    if (n % 2 == 0):
        return(True)
    return(False)

Note the difference in the ordering of the first lines. In R, you first define the function name, then you assign to that name the function and argument(s).

In Python, you do not use any assignment to create a function. Rather, you use the command def followed by the function name and argument(s). Here, the Python syntax is quite natural - you use the same syntax that you would use when calling the function (e.g. isEven(n)).

Note also that the if statement in Python also uses the same general syntax of indented white space instead of using the {} symbols.

Python Methods

You might have heard people (i.e. me) say that Python is more “object-oriented” whereas R is more “functional.” What I mean is that in R you mostly apply functions to objects, but in Python you often call special functions that belong to certain object types. Here’s an example of converting a string to upper case:

R: We use the string "foo" as an argument to the str_to_upper() function from the stringr library, which returns "FOO".

stringr::str_to_upper("foo")
#> [1] "FOO"

Python: we use the .upper() method that belongs to the string "foo", which returns "FOO". All strings in Python have this method.

"foo".upper()
## 'FOO'

Methods are special functions that belong to objects of a certain class. You “call” methods using the name of the object followed by the . symbol, like this:

object.method()

You can also see the different methods available for a particular object by calling the dir function on the object:

s = "foo"
dir(s)
## ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

Wow, strings have a lot of methods!

The concept of using methods is a major part of the “object-oriented” way of programming, since it’s the object that is the center of attention. The object in Python is more than just a stored value - it’s a source of other methods (depending on the object’s class).

Now that you’ve seen a little about how Python methods work, we’ll get to use some working with strings!

Strings

String manipulation is one area where more substantial differences emerge between Python and R. Because R’s built in functions for dealing with strings are rather unintuitive, we’ve relied on the stringr package.

In Python, many of the basic string manipulations are actually done with basic arithmatic operators, just like with numbers. Here are a few comparisons:

R

Python

String concatenation:

In R, we use the function paste() to combine strings:

paste("foo", "bar", sep = "")
#> [1] "foobar"

In Python, you can combine strings by “adding” them together. The default is to merge them with no space in between:

"foo" + "bar"
## 'foobar'

String repetition:

Creating a repeated string is even more complicated in R. You first have to create a vector of repeated strings, and then “collapse” them using the paste() function:

paste(rep("foo", 3), collapse = '')
#> [1] "foofoofoo"

In Python, you can just “multiply” the string, like this:

"foo" * 3
## 'foofoofoo'

Sub-string detection:

In R, we use the str_detect() function:

str_detect('Apple', 'ppl')
#> [1] TRUE

In Python, you can detect sub-strings using the in operator:

'ppl' in 'Apple'
## True

Functions and methods

Because Python has both functions and object methods, it can sometimes be tricky to remember which to use for a specific purpose. For example, if you want to know how many characters are in a string, you use a function, just like in R:

R

Python

String length:

str_length('foo')
#> [1] 3
len('foo')
## 3

However, lots of basic string manipulations are done with string methods:

R

Python

Case converstion:

s <- "A longer string"
str_to_upper(s)
#> [1] "A LONGER STRING"
str_to_lower(s)
#> [1] "a longer string"
str_to_title(s)
#> [1] "A Longer String"
s = "A longer string"
s.upper()
## 'A LONGER STRING'
s.lower()
## 'a longer string'
s.title()
## 'A Longer String'

Remove excess white space:

s <- "     A string with space     "
str_trim(s)
#> [1] "A string with space"
s = "     A string with space     "
s.strip()
## 'A string with space'

Detect if string contains only numbers:

R doesn’t have a function for this, but you can convert it to a number and check if the result is not NA:

s <- "42"
!is.na(as.numeric(s))
#> [1] TRUE

Python has some handy string methods!

s = "42"
s.isnumeric()
## True

Slicing

To extract a sub-string, in R we have to use the str_sub() function. But in Python, you can simply use the [] symbols. In either case, you have to provide indices of where to start and stop the “slice”.

For example, here’s how to extract the sub-string "App" from "Apple" in each language:

R:

s <- "Apple"
str_sub(s, 1, 3)
#> [1] "App"

Python:

s = "Apple"
s[0:3]
## 'App'

Note that we had to use a different starting index here to get the same sub-string in each language. That’s because indexing starts at 0 in Python.

If this seems strange, just imagine “fence posts”. In Python, the elements in a sequence are like items sitting between fence posts. So the index of each character in the string "Apple" look like this:

index: 0     1     2     3     4     5
       |     |     |     |     |     |
       | "A" | "p" | "p" | "l" | "e" |
       |     |     |     |     |     |

When you make a slice in Python, you slice at the fence post number to get the elements between the posts. So in this case, if we want to get the sub-string "App" from "Apple", we need to slice from the post 0 to 3.

Negative indices are also handled differently.

R: Negative indices start from the end of the string inclusively:

str_sub(s, -1)
#> [1] "e"
str_sub(s, -3)
#> [1] "ple"

Python: Negative indices start from the end of the string, but only return the character at that index:

s[-1]
## 'e'
s[-3]
## 'p'

To get an inclusive string, you have to provide a starting and ending index:

s[-3:-1]
## 'pl'
s[-3:5]
## 'ple'

You can get the index of a character or sub-string in Python using the .index() method:

R: Returns the starting and ending indices of the sub-string

str_locate(s, "pp")
#>      start end
#> [1,]     2   3

Python: Returns only the starting index of the sub-string

s.index("pp")
## 1

Splitting strings

Like in R, splitting a string returns a list of strings. Python lists are similar to R lists, but they only have single brackets. Here’s an example:

R:

s <- "Apple"
str_split(s, "pp")
#> [[1]]
#> [1] "A"  "le"

Python:

s = "Apple"
s.split("pp")
## ['A', 'le']

In both languages, the returned list contains the remaining characters after splitting the string (in this case, "A" and "le"). One main difference though is that R returns a list of vectors, so to access the returned vector containing "A" and "le" you have to access the first element in the list, like this:

str_split(s, "pp")[[1]]
#> [1] "A"  "le"

This is because in R the str_split() function is vectorized, meaning that the function can also be performed on a vector of strings, like this:

s <- c("Apple", "Snapple")
str_split(s, "pp")
#> [[1]]
#> [1] "A"  "le"
#> 
#> [[2]]
#> [1] "Sna" "le"

In this example, it’s easier to see that R is returning a list of vectors. In contrast, Python cannot perform a split on multiple strings:

s = ["Apple", "Snapple"]
s.split("pp")
## AttributeError: 'list' object has no attribute 'split'

To handle this, you will need to import the numpy package, which has an “array” structure similar to R vectors (we’ll cover this in more detail on week 13). Here’s an example:

import numpy as np

s = np.array(["Apple", "Snapple"])
np.char.split(s, "pp")
## array([list(['A', 'le']), list(['Sna', 'le'])], dtype=object)

Running a Python script in R

While R scripts end in .R, Python scripts end in .py. You can open up and save a blank Python script in RStudio by clicking

File -> New File -> Python Script

Save it as foo.py in your project folder. Now that it’s saved, let’s add some code to run. As a quick example, I’m going to add code defining the function isOdd() and then create a few values testing it:

def isOdd(n):
    if (n % 2 == 1):
        return(True)
    return(False)

n1 = isOdd(4)
n2 = isOdd(3)

Now that you have this code stored in your foo.py file, you can source the file from inside R, like this:

reticulate::source_python('foo.py')

Magically, the function isOdd() and the objects we created (n1 and n2) are now accessible from R!

isOdd(7)
## [1] TRUE
n1
## [1] FALSE
n2
## [1] TRUE

Summary of R/Python differences

  • Indexing starts at 0 in Python and 1 in R.
  • Strings in Python can be manipulated with arithmetic operators.
  • Python is more “object-oriented” whereas R is more “functional”.

Tips

Making your own Python class

You can get really creative with object-oriented programming in Python by creating your own custom classes, allowing you to embed values and methods that belong only to objects of that class. For example, here’s how to create a class called Animal, which is defined by two values: species and sound. Note the white space indentations - without them Python will error:

class Animal:
    def __init__(self, species, sound):
        self.species = species
        self.sound = sound

The first function in any custom class is the __init__ function. This is where to define any arguments that must be input when defining an object of the custom class. The use of self here defines which methods and values will be stored in the object onces it’s created.

Here’s a example of how we could use the Animal class:

riley = Animal("Dog", "Woof")

Here I’ve defined an object named riley (my dog’s name), and it has two stored values: "Dog" (the species) and "Woof" (the sound). I can access these stored values by calling the species and sound values from the riley object:

riley.species
## 'Dog'
riley.sound
## 'Woof'

I can also ask Python what type of object riley is, and it will tell me it’s of the Animal class:

type(riley)
<class '__main__.Animal'>

In addition to just storing values, you can create custom methods that will only be accessible to objects of the custom class. Here I’m adding the method introduce() to the class Animal:

class Animal:
    def __init__(self, species, sound):
        self.species = species
        self.sound = sound

    def introduce(self):
        print("I'm a " + self.species + " and I say " + self.sound)

Now let’s re-define my riley object and try out our new method!

riley = Animal("Dog", "Woof")
riley.introduce()
## I'm a Dog and I say Woof

EMSE 4571: Intro to Programming for Analytics (Spring 2022)
Thursdays | 12:45 - 3:15 PM EST | Tompkins 208 | Dr. John Paul Helveston | jph@gwu.edu
LICENSE: CC-BY-SA