3

Problem: I have searched several websites/blogs/etc to find a solution but did not get to what I was looking for. The problem in short, is that I would like to scrape a site - but to get to that site - I have to get past the login page.

What I did: I did manage to use urllib2 and httplib to open the page, but even after logging in (no errors being displayed) the redirection of the login page as shown in the browser does not happen. My code was not too different than what was displayed here: How to use Python to login to a webpage and retrieve cookies for later usage? ; except that I did not use Cookies.

What am I looking for? I am not entirely sure what fields I need to be looking for besides the "username" and "password" fields. What I would like for the script to do is 1) Successfully login to the .aspx site and display a message of some sort that the login was successful 2) Redirect to another page after logging in, in order for me to scrape the data off from the site. 3) How to gather any site's POST/GET fields so I know that I am passing/calling the right parameters?

Any assistance/help/advise would be much appreciated.

Community
  • 1
  • 1
Adil
  • 31
  • 3
  • Use some developer tools such as Webdeveloper or Firebug for Firefox to capture actually sent form data. Replicate the same data. And of course, do use and store Cookies ;-). Without them the next web page has no idea about your granted access. – Fenikso Apr 15 '12 at 17:31
  • httplib2 can handle cookies and all that work for you. You may try it http://code.google.com/p/httplib2/ – yo_man Apr 15 '12 at 17:38
  • Thank you for the responses. I would like to use httplib2 - but I am not entirely sure if it works with Python 2.7.x - I shall try and let you'll know. In the meantime, I was wondering what exactly I need to be looking for besides the username and password fields when dealing with a login page. I am dealing with a .aspx page - and I was told that there are many 'hidden' fields. When I use Firebug - there are tons of 'stuff' that appears - what should I be concerned with? – Adil Apr 15 '12 at 18:21
  • The example you have showed is just ok for the purpose. Without cookies it will not work. First enable cookies and if it still does not work, start adding the hidden fields. I usually guess what is important. – Fenikso Apr 15 '12 at 18:55
  • Fenisko, I did enable cookies and it renders the same login page. Following is the code for your ready reference: import urllib, urllib2, cookielib username = 'name' password = 'pass' cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) login_data = urllib.urlencode({'lcLogin$tbUsername' : username, 'lcLogin$tbPassword' : password,'lcLogin$lbtnLogin' : 'Enter'}) opener.open('https://website.net/Dashboard/Login.aspx', login_data) resp = opener.open('https://website.net/Dashboard/Customer_Monitor.aspx') print resp.read() Does not show any errors. Ideas? – Adil Apr 16 '12 at 14:37
  • The login page contains Javascript and I think that may be causing issues. Following is the code: * – Adil Apr 16 '12 at 18:31
  • So, in short, I am dealing with a login site that is .aspx - ie : 1)it deals with _viewstates, etc - 2) the submit button has javascript 3) Firebug shows that there are 2 redirects 4) Auto Login does not seem to work, as after logging in, it still displays the login page :( Please advise – Adil Apr 17 '12 at 00:49
  • So, I guess no one has any leads on this situation? – Adil Apr 23 '12 at 17:59

0 Answers0