Scraping a site with python3.6. I can't progress past the login page

Question

The html form code for the site:

                <form class="m-t" role="form" method="POST" action="">

                <div class="form-group text-left">
                    <label for="username">Username:</label>
                    <input type="text" class="form-control" id="username" name="username" placeholder="" autocomplete="off" required />
                </div>
                <div class="form-group text-left">
                    <label for="password">Password:</label>
                    <input type="password" class="form-control" id="pass" name="pass" placeholder="" autocomplete="off" required />
                </div>

                <input type="hidden" name="token" value="/bGbw4NKFT+Yk11t1bgXYg48G68oUeXcb9N4rQ6cEzE=">
                <button type="submit" name="submit" class="btn btn-primary block full-width m-b">Login</button>

Simple enough so far. I've scraped a number of sites in the past without issue.

I have tried: selenium, mechanize(albeit had to drop back to earlier version of python), mechanicalsoup, requests.

I have read: multiple posts here on SO as well as: https://kazuar.github.io/scraping-tutorial/ http://docs.python-requests.org/en/latest/user/advanced/#session-objects and many many more.

Sample code:

import requests
from lxml import html
session_requests = requests.session()
result = session_requests.get(url)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='token']/@value")))[0]
result = session_requests.post(
    url, 
    data = payload, 
    headers = dict(referer=url)
)
result = session_requests.get(url3)
print(result.text)

and

import mechanicalsoup
import requests
from http import cookiejar

c = cookiejar.CookieJar()
s = requests.Session()
s.cookies = c
browser = mechanicalsoup.Browser(session=s)

login_page = browser.get(url)

login_form = login_page.soup.find('form', {'method':'POST'})

login_form.find('input', {'name': 'username'})['value'] = username
login_form.find('input', {'name': 'pass'})['value'] = password

response = browser.submit(login_form, login_page.url)

Try as I might I just cannot return anything other than the html code for the login page and I don't know where to explore next to actually figure out what's not happening and why.

url = variable that holds login page url, url3 = a page I want to scrape.

Any help would be much appreciated!

You might want to use `fiddler` to capture all traffic while logging in and find out what happens behind the scene, then simulate that process just like your first example does, debug with `127.0.0.1:8888` and compare your requests with the actual login requests until you get the correct response from server. — Shane, Jan 24 '17 at 11:21
Thanks for the response Shane. I've never come across fiddler before, can you provide a link please? Is it a python module or other program? — Oceanic_Panda, Jan 24 '17 at 11:36
If I'm not mistaken http://docs.telerik.com/fiddler would be it? I don't have admin access on this work machine so that'll need to be a backup for when I get home. — Oceanic_Panda, Jan 24 '17 at 11:39

Ujjaval Moradiya · Accepted Answer · 2017-01-25T12:07:20.027

1

Did you tried headers?

First try on the browser and observe what headers are required and send headers in the requests. Headers are important part to identify user or client.

Try from the different IP, may be someone is watching the requested Ip.

Try this example. Here I am using selenium and chrome driver. First I am getting cookie from selenium and I am saving that in a file for later purpose and then I am using requests with the saved cookie to access pages which requires login.

from selenium import webdriver
import os
import demjson

# download chromedriver from given location and put at some accessible location and set path
# utl to download chrome driver - https://chromedriver.storage.googleapis.com/index.html?path=2.27/
chrompathforselenium = "/path/chromedriver"

os.environ["webdriver.chrome.driver"]=chrompathforselenium
driver=webdriver.Chrome(executable_path=chrompathforselenium)
driver.set_window_size(1120, 550)

driver.get(url1)

driver.find_element_by_name("username").send_keys(username)
driver.find_element_by_name("pass").send_keys(password)

# you need to find how to access button on the basis of class attribute
# here I am doing on the basis of ID
driver.find_element_by_id("btnid").click()

# set your accessible cookiepath here.
cookiepath = ""

cookies=driver.get_cookies()
getCookies=open(cookiepath, "w+")
getCookies.write(demjson.encode(cookies))
getCookies.close()

readCookie = open(cookiepath, 'r')
cookieString = readCookie.read()
cookie = demjson.decode(cookieString)

headers = {}
# write all the headers
headers.update({"key":"value"})

response = requests.get(url3, headers=headers, cookies=cookie)
# check your response

edited Jan 25 '17 at 12:07

answered Jan 24 '17 at 12:37

Ujjaval Moradiya

222
1
12

Thanks for the reply Bonny. I now have HTTP live headers. How do I know which part to include in the header? I see from here: http://stackoverflow.com/questions/6260457/using-headers-with-the-python-requests-librarys-get-method I would add Content-Type: text/html; charset=utf-8, is this correct? I doubt it's an IP issue since I'm working on my company website from an internal computer. – Oceanic_Panda Jan 24 '17 at 13:40
It depends to whom you are dealing with and what kind of request you are making. As an example, for the sites like banking, they will watch everything like user agent, content type, referer. If it is some kind of api than may be authorized parameters are needed. So it depends what they want. Try to send every headers that you are getting from on the browser. – Ujjaval Moradiya Jan 24 '17 at 13:55
So it may be worth mentioning at this stage that I have access to the person who made the site. They've been learning php while they make it. So if there are questions that I can ask rather than trial and error testing then that's an option.I have tried discussing with them what I'm trying to achieve and they haven't been able to help due to no experience in web scraping. – Oceanic_Panda Jan 24 '17 at 14:22
Can you write the headers here that you are getting from the browser and the login page url? Please make sure that you are making correct request. After the first url, (i.e. url = variable that holds login page url), are there any other url required? If yes, then you also have to call that url. Cause many websites are generating tokens, random numbers on the front end after login and they are sending back for verification purpose. – Ujjaval Moradiya Jan 25 '17 at 06:53
I went to the page and logged in as I normally would from Firefox, this is what I got: http://pastebin.com/VikRnvSB – Oceanic_Panda Jan 25 '17 at 08:30
ok. Try sending headers - host, User-Agent, Accept, Accept-Language, Accept-Encoding, Connection, Upgrade-Insecure-Requests. I want to know one thing, headers for this written 3 times - "http://*****/intranet". which one is the first ? – Ujjaval Moradiya Jan 25 '17 at 10:33
The oldest appears at the top – Oceanic_Panda Jan 25 '17 at 11:32
Still only receive then login page – Oceanic_Panda Jan 25 '17 at 11:56
please check my post, I added code. Try and let me know. – Ujjaval Moradiya Jan 25 '17 at 12:07
Ah I see why I'm having those errors... chrome does actually also need to be installed..... my sys admin only allows firefox. I'm trying to figure out how to configure the firefoxdriver since I always get errors there. I'm on windows 7 btw. – Oceanic_Panda Jan 25 '17 at 15:35
Mine is ubuntu. You can also do the same in firefox. Just find how to do that. – Ujjaval Moradiya Jan 26 '17 at 04:22
alrighty! I've played around a bit and I'm now at the following error: cookiejar.set_cookie(create_cookie(name, cookie_dict[name])) TypeError: list indices must be integers or slices, not dict – Oceanic_Panda Jan 30 '17 at 11:58
How cookiejar came into the picture? Selenium generates cookie in dictionary format only. So there should not be the issue. Which browser are you using? – Ujjaval Moradiya Jan 31 '17 at 07:05
http://pastebin.com/RsmQ9WVv here's my code as it stands, I had to make a couple of adjustments to the code you supplied to get to this stage. I've played around with having headers or not and haven't seen any difference yet though to be fair both get to the same stage in the errors. – Oceanic_Panda Jan 31 '17 at 08:42
This is the error: Traceback: File "C:\...test.py", line 35, in response = requests.get(url2, cookies=cookie) File "C:\...\api.py", line 70, in get return request('get', url, params=params, **kwargs) File "C:\...\api.py", line 56, in request return session.request(method=method, url=url, **kwargs) File "...\sessions.py", line 474, in request prep = self.prepare_request(req) File "...\sessions.py", line 385, in prepare_request cookies = cookiejar_from_dict(cookies) File "...\cookies.py", line 518, in cookiejar_from_dict cookiejar.... – Oceanic_Panda Jan 31 '17 at 09:07
I can't help this way. I need to look what you want to scrap and I think its not possible here. If you can do then let me know. – Ujjaval Moradiya Jan 31 '17 at 11:49
OK, thanks for all your help! It's all on the company intranet so no way I can give you access to see unfortunately. – Oceanic_Panda Jan 31 '17 at 11:53
Code now works! I had to comment out everything between driver.find_element_by_name("btnid").click() and requests.get(url2). I then started a requests session and added for loop on driver.get_cookies(): and did a session cookies update. Thanks for the help. I will add a separate answer for any future askers of the same question. – Oceanic_Panda Feb 01 '17 at 09:59
yes its good. I stored cookies in a file. In your case after half an hour if you want to do something you need to go from login page and if cookie is stored in a file you can read cookie from file and without login you can perform your tasks. – Ujjaval Moradiya Feb 01 '17 at 12:20

score 1 · Answer 2 · answered Feb 01 '17 at 10:03

This is the code that ended up working:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import demjson
import requests
capabilities = DesiredCapabilities.FIREFOX.copy()
import os
os.chdir('C:\\...') #chdir to the dir with geckodriver.exe in it
driver = webdriver.Firefox(capabilities=capabilities, firefox_binary='C:\\Program Files\\Mozilla Firefox\\firefox.exe')
username = '...'
password = '...'
url = 'https://.../login.php' #login url
url2 = '...' #1st page you want to scrape

driver.get(url)
driver.find_element_by_name("usr").send_keys(username)
driver.find_element_by_name("pwd").send_keys(password)

driver.find_element_by_name("btn_id").click()

s = requests.session()
for cookie in driver.get_cookies():
    c = {cookie['name']: cookie['value']}
    s.cookies.update(c)


response = s.get(url2)

Scraping a site with python3.6. I can't progress past the login page

2 Answers2

Linked