0

Part 1 : Get Cookies ( working )

Part 2 : Use cookies with another sub-page ( not working )

2.1 https://www.valueresearchonline.com/funds/26123/motilal-oswal-flexi-cap-fund-regular-plan/#fund-portfolio : shows "Top Holdings" section when page is in logged in state.

I'm not understanding why sub-page is not logged in state even when cookies are provided to it.

Part 1 ( Get cookies )

Connection.Response login2 = Jsoup.connect("https://www.valueresearchonline.com/login/?")
            .timeout(15000)
            .userAgent("Mozilla")
            .data("username", "valid_email")
            .data("password", "valid_password")
            .method(Connection.Method.POST)
            .execute();

System.out.println(login2.statusCode());
System.out.println(login2.cookies());
doc = login2.parse();

System.out.println(doc.body().text().indexOf("My Favourite Stories"));
System.out.println(doc.body().text().indexOf("Logout"));

String sessionId2 = login2.cookie("PHPSESSID");

Chrome Dev Tool Network Tab Output

 curl 'https://www.valueresearchonline.com/login/?' \
            -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
            -H 'Accept-Language: en' \
            -H 'Cache-Control: max-age=0' \
            -H 'Connection: keep-alive' \
            -H 'Cookie: PHPSESSID=pklb54pa5chma4hi69bgfu7vcc; currency=INR; magnitude=LC; ad=53c5b4ffbcbd345c755abe149e639d10aa8fdb70; ad=53c5b4ffbcbd345c755abe149e639d10aa8fdb70; wec=295393018; nobtlgn=368510251; ac=67889375%7C379669468%7C35351482; ac=67889375%7C379669468%7C35351482; _gcl_au=1.1.265956817.1663225155; _gid=GA1.2.417218768.1663225156; _gat_UA-240759-1=1; _clck=5maw1g|1|f4w|0; _fbp=fb.1.1663225156547.1723305957; __gads=ID=456fb6dced89a9d4-22af789890d6008a:T=1663225157:S=ALNI_MYgbR1A0Fh3kAwwVwrYjWKMDAYOuA; __gpi=UID=000009c87e7a6399:T=1663225157:RT=1663225157:S=ALNI_MZ4wVT9remS6N-MFKJvbFF1GCqtRg; __cf_bm=XMBuGU25ky6q.8Z_vTFVtdWF.EWPPsrG8Buy1QFgIl4-1663225158-0-AW29ht0HH+iYPhBRF4AU3bmUim5cGJvhNtZuM41NVaC0kWPvnRr4/+1v+n+0Q8iA6SxKD8m9lScYnM8T/HfGonbEoOz84uh83Y7d98O4qe/mVT8Ixv4yya4ZWhzazxOboQ==; _ga=GA1.1.448580854.1663225156; _ga_N9R425YFBJ=GS1.1.1663225155.1.1.1663225167.48.0.0; _clsk=15nbblb|1663225167506|3|1|l.clarity.ms/collect; pgv=4' \
            -H 'Referer: https://www.valueresearchonline.com/register' \
            -H 'Sec-Fetch-Dest: document' \
            -H 'Sec-Fetch-Mode: navigate' \
            -H 'Sec-Fetch-Site: same-origin' \
            -H 'Sec-Fetch-User: ?1' \
            -H 'Upgrade-Insecure-Requests: 1' \
            -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' \
            -H 'sec-ch-ua: "Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"' \
            -H 'sec-ch-ua-mobile: ?0' \
            -H 'sec-ch-ua-platform: "macOS"' \
            --compressed

Part 2 - Not working ( Page is NOT in logged in state )

//Post Request - Not working

login2 = Jsoup.connect("https://www.valueresearchonline.com/funds/26123/motilal-oswal-flexi-cap-fund-regular-plan/#fund-portfolio")
                .timeout(15000)
                .userAgent("Mozilla")
                .cookie("PHPSESSID", sessionId2)
                .cookies(login2.cookies())
                .method(Connection.Method.POST)
                .execute();

System.out.println(login2.statusCode());
    doc = login2.parse();
    System.out.println(doc);
 

//Get Request - Not working

doc = Jsoup.connect("https://www.valueresearchonline.com/funds/26123/motilal-oswal-flexi-cap-fund-regular-plan/#fund-portfolio")
            .userAgent("Mozilla")
            .timeout(15000)
            .cookies(loginResponse.cookies())
            .get();

Chrome Dev Tool Network Tab Output

curl 'https://www.valueresearchonline.com/fund-details/26123/?tab=fund-portfolio' \
      -H 'Accept: application/json, text/javascript, */*; q=0.01' \
      -H 'Accept-Language: en-US,en;q=0.9' \
      -H 'Connection: keep-alive' \
      -H 'Cookie: currency=INR; magnitude=LC; ad=78991d1d28c094ebf1f39eb89bdeba08fa7442fb; ad=78991d1d28c094ebf1f39eb89bdeba08fa7442fb; wec=295383799; nobtlgn=789939331; ac=67886089%7C246429140%7C430534802; ac=67886089%7C246429140%7C430534802; _gcl_au=1.1.695297406.1663222951; _gid=GA1.2.274867828.1663222952; _clck=qcxo6j|1|f4w|0; __cf_bm=bzlNVWAtaJiSxUfJ75njw.Zjxxhm_6NdpHRAnt_yZME-1663222953-0-AZ+JMB1vgmANxPS0dbOP5fijqdwMV2dO8gcChGvTkmBsdjKzFC0dMTF8H7zJFtDVwy16hjeygZ224SUimQNMxPmNjen+nfhLNp9v9dHjxMy/ezpdYYa1rYd+7JGe4RS/lA==; alp=VROL; PERMA-ALERT=0; g_state={"i_t":1663309591263,"i_l":0}; PHPSESSID=adcn3ck7fuinlliqnmqco9d4ep; shop-beta=ee1e0e7e3a3617e78e0827d43a83398fd12221b2; aa=364476%7C372053540%7C953882152; aa=364476%7C372053540%7C953882152; arl=801870920; arl=801870920; _clsk=11pwsr9|1663223225068|10|1|l.clarity.ms/collect; _gat_UA-240759-1=1; pgv=17; _ga_N9R425YFBJ=GS1.1.1663222951.1.1.1663223343.60.0.0; _ga=GA1.1.1796800378.1663222952' \
      -H 'Referer: https://www.valueresearchonline.com/funds/26123/motilal-oswal-flexi-cap-fund-regular-plan/' \
      -H 'Sec-Fetch-Dest: empty' \
      -H 'Sec-Fetch-Mode: cors' \
      -H 'Sec-Fetch-Site: same-origin' \
      -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' \
      -H 'X-Requested-With: XMLHttpRequest' \
      -H 'sec-ch-ua: "Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"' \
      -H 'sec-ch-ua-mobile: ?0' \
      -H 'sec-ch-ua-platform: "macOS"' \
      --compressed
  

Solutions Tried

  1. jsoup posting and cookie

  2. Jsoup Cookies for HTTPS scraping

  3. Login to a website using Jsoup and stay on the site

I'm getting cookies post login but when I am using the same to get info from another url of same website; it's always is "not" logged in state.

vikramvi
  • 3,312
  • 10
  • 45
  • 68
  • Are you sure the login is successful? – Olivier Sep 05 '22 at 07:09
  • Yes, as https://www.valueresearchonline.com/ has got "Logout" seen in "Document" from "Part 1 / Step 1" – vikramvi Sep 05 '22 at 07:18
  • 1
    is it ok to share dummy email and pwd here so that you can check please ? – vikramvi Sep 05 '22 at 11:28
  • @Olivier can you please clarify per update above to your question ? – vikramvi Sep 08 '22 at 12:25
  • Try to debug the response body() of the login, to assure it's really logged in, not showing any errors or something (sometimes they don't support certain states). And also debug the cookies. And show us the results – DiLDoST Sep 10 '22 at 01:41
  • can you please clarify how to "debug" cookies and response body() of the login ? – vikramvi Sep 10 '22 at 08:27
  • attach a debugger, set a breakpoint and inspect the value of that variable(s) – cyberbrain Sep 10 '22 at 21:29
  • Maybe [this](https://stackoverflow.com/questions/34468944/how-to-login-with-https-self-signed-certified-in-jsoup/34469830#34469830) or [this](https://stackoverflow.com/questions/31871801/problems-submitting-a-login-form-with-jsoup/31877829#31877829) can help you. – TDG Sep 13 '22 at 13:56
  • @TDG I tried both the links but it didn't work. Can you please check "Part 2", should I use Get or Post there ? I tried both method and it didn't work as well. In browser I first login and directly paste that url; it works but how to simulate this using JSOUP ? – vikramvi Sep 14 '22 at 07:47
  • @TDG .cookie("PHPSESSID", sessionId2) .cookies(login2.cookies()) , is this correct ? Should I use cookie / cookies ? – vikramvi Sep 14 '22 at 07:58
  • Before jumping to part 2, start from the begining - open the browser's dev tools, login to the page and load the page you want. Check the network tab and see all the fields that are included in the get/post request, not just the username and password. Check also the request headers - some servers check that all the headers, such as useragent, are also included. If you see some other values in the request, besides the username and password, clear the cache of the browser, load again the first page and see if you can find these values there. – TDG Sep 14 '22 at 15:00
  • I've edited question and added both cURL calls from network tab. I'm doubting if I'm passing cookie value properly in 2nd call. Is there a way to pass it as "string" ? https://stackoverflow.com/questions/73713898/how-to-pass-cookie-value-as-string-instead-of-mapstring-string-in-jsoup – vikramvi Sep 15 '22 at 06:47
  • @TDG I've updated questions with cURL, can you please have a look and clarify ? I'm blocked because of this issue since last 2+ weeks. Thanks in advance. – vikramvi Sep 16 '22 at 05:51
  • When I try to login (with dummy user, since I don't have an account) I see the following fileds in the post request - `{"username": "myusername", "password": "123456", "provider": "VROL", "site-code": "VROL", "order-id": "", "token": "", "target": "/register", "url-hash": ""}`. As I wrote above - you have to start from the first page in your browser, and record all the requests and responses as you follow the login procedure, and then you can use jsoup to do the same. There are no shortcuts. – TDG Sep 16 '22 at 07:23
  • just curious, if this approach succeeded? – cilap Dec 07 '22 at 07:11
  • No, this didn't work in this particular case. – vikramvi Dec 08 '22 at 10:35
  • as written, will be hard since it looks like there is js based execution behind (see also my answer) – cilap Dec 30 '22 at 08:24

1 Answers1

-1

looks for me you want to build a web scraper. To circumvent this maybe a law infringement, depending on your country. Or probably your client's country

If it is allowed check for webdriver based approach, which will use a real browser to fix the issue. You cannot execute JavaScript which is necessary to make your call possible. Use WebDriver or HtmlUnit (less success) to make your issue work.

cilap
  • 2,215
  • 1
  • 25
  • 51
  • 1
    Why did this bounty awarded to completely wrong answer ? I know about webdriver approach but the question was about "jsoup" specifically. SO moderators should look into "bounty award" feature as it's getting awarded to wrong people – vikramvi Sep 12 '22 at 11:26
  • ohh what a nice comment to someone who want to help. Anyway I have updated my answer. And if you want still jsoup, you have to reimplement the client side code too. Good luck on this approach – cilap Sep 13 '22 at 05:13