How to scrape a website that requires login first with ruby Mechanize gem

Question

I was trying to learn the usage of ruby Mechanize gem from which I was able to fill the form and login to the website. But I was not able to extract the after logging in. Basically that website is displaying data only after logged in else it shows some default strings. eg: 'View website' instead of www.example.com

I tried writing this code:

#code to login
require 'mechanize'
require 'logger'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'zlib'

mechanize = Mechanize.new

login = mechanize.get('website login page')
form = login.forms.first
form['student_email'] = 'email@gmail.com'
form['student_password'] = 'password'
result = form.submit
puts result.code
puts "logged in"

#code to extract
url = 'data_path_url'
    doc = Nokogiri::HTML(open(url))
    paths = doc.css('.college_name a')  #capturing the link to extract.
    paths.each do |path|
        path = path['href']
        path = path.to_s
        page = Nokogiri::HTML(open(path))
        data = page.css('.font11.bold') #data to extract
        puts data.text #data to display.

    end

I was still getting the default strings which I have to get without logging in. I would be glad if someone could help me with this code to stay in the session until the extraction completes.

To keep session you should capture cookies and send them on each request following log in one. Otherwise each next request will look like a new one for the server. — yefrem, Jul 04 '16 at 10:08
Read "[mcve]". Your code contains a lot that doesn't relate to the question. Please reduce it to the minimum necessary or show the code that ties the second section to the first, otherwise you waste the time of those helping answer by making them sift through it to determine what's used or not. — the Tin Man, Jul 04 '16 at 21:08

score 2 · Answer 1 · answered Jan 14 '17 at 11:24

When you try to open the URL with Nokogiri, the server sees it as a new request, and will need to authenticate that user, hence, you need to capture cookies and send them on each request.

However, an easier way to achieve the result will be to use mechanize for the scraping. Since it was built on Nokogiri and the Nokogiri methods are also available in mechanize.

This is a modification of your code to scrape using Mechanize

agent = Mechanize.new

In your case you can use 'mechanize' in place of agent.

#code to extract data

doc = agent.get('data_path_url')
paths = doc.css('.college_name a')  #capturing the link to extract.
paths.each do |path|
    path = path['href']
    path = path.to_s
    page = agent.get('path')
    data = page.css('.font11.bold') #data to extract
    puts data.text #data to display.
end

The key here is to just continue the scraping with the mechanize instance you created, since it already has an active session on the server.

How to scrape a website that requires login first with ruby Mechanize gem

1 Answers1