0

I am trying to login to a website which requires a username and password using rvest.

I am using this as a resource as I found it very helpful: https://awesomeopensource.com/project/yusuzech/r-web-scraping-cheat-sheet#rvest7.5

When I submit the form for login I receive a HTTP 404 warning message and can not proceed with reading any of the html on the webpage.

Submitting with 'NULL'
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode,  :
  Not Found (HTTP 404).

Can anyone who understands HTML please help me understand if I am passing the right fields in my submit form?

My code looks as follows:

install.packages("pacman")

# LOAD LIBRARIES
pacman::p_load(rvest,purrr,xml2,dplyr,stringr)

# TARGET URL
url <- "https://www.mywebsite.com/"

# SPOOF THE USER AGENT TO LOOK LIKE A BROWSER
ua <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36")

# CREATE A PERSISTANT SESSION
my_session <- rvest::html_session(url,ua)

# FIND ALL FORMS IN THE WEB PAGE
unfilled_forms <- rvest::html_form(my_session)

# SELECT THE FORM THAT YOU NEED TO FILL IN
login_form <- unfilled_forms[[1]]

#FILL IN THE FORM
filled_form <- set_values(login_form, username = "myUsername", password = "myPassword")

# SUBMIT THE FORM TO LOGIN
login_session <- submit_form(my_session, filled_form)
TheGoat
  • 2,587
  • 3
  • 25
  • 58
  • 1
    `session` or `my_session`? – r2evans Feb 04 '21 at 04:31
  • 1
    Your code and process seems valid, sorry I can't go further (since I don't have access myself, and I'm not recommending you provide a user/pass to me :-). Perhaps not likely, but it is possible that they are filtering on the User Agent, you might try [setting it](https://httr.r-lib.org/reference/user_agent.html) to something more like your browser (since I'm assuming you can do all of this interactively in a web browser, so it's likely not your user or pass). – r2evans Feb 04 '21 at 04:39
  • @r2evans thanks for the pointer about session <> my_session, I have updated the code to capture this. I also modified my code to take into account your recommendation about changing the user agent to spoof that it's a browser logging in unfortunately I am still having the same 404 error. Back to the drawing board. – TheGoat Feb 04 '21 at 09:12

1 Answers1

1

I decided to change direction and use Rselenium which took a few hours to get the hang of but I go there.

Rselenium is really useful when logins are required, I wish I knew about this months ago for another project I worked on.

library(RSelenium)
# https://stackoverflow.com/questions/55201226/session-not-created-this-version-of-chromedriver-only-supports-chrome-version-7/56173984
rd <- rsDriver(browser = "chrome",
               chromever = "88.0.4324.27",
               port = netstat::free_port())

remdr <- rd[["client"]]

url <- "https://www.mywebsite.com/"  # url of the site's login page

remdr$navigate(url)  # Navigating to the page

Sys.sleep(10)

loginbutton <- remdr$findElement(using = 'css selector','.plain')

loginbutton$clickElement()

username <- remdr$findElement(using = 'css selector','#username')

password <- remdr$findElement(using = 'css selector','#password')

login <- remdr$findElement(using = 'css selector','#btnLoginSubmit1')

username$sendKeysToElement(list("myUserName"))

password$sendKeysToElement(list("myPassword"))

login$clickElement()
TheGoat
  • 2,587
  • 3
  • 25
  • 58