0

I'm trying to do some web scraping from a simple form in C#.

My issue is trying to figure out the action to post to and how to work out the post params.

The form I am trying to submit has:

<form method="post" action="./"

As the page sits at www.foobar.com I am creating a WebRequest object in my C# code and posting to this address.

The other issue with this is that I am not sure of the post values as the inputs only have ids not names:

<input name="ctl00$MainContent$txtSearchName" type="text" maxlength="8" id="MainContent_txtSearchName" class="input-large input-upper">

So I read this: c# - programmatically form fill and submit login, amongst others and my code looks like this:

        var httpRequest = WebRequest.Create("https://www.foobar.com/");
        var values = "SearchName=Foo&SearchLastName=Bar";

        byte[] send = Encoding.Default.GetBytes(values);
        httpRequest.Method = "POST";
        httpRequest.ContentType = "application/x-www-form-urlencoded";
        httpRequest.ContentLength = send.Length;

        Stream sout = httpRequest.GetRequestStream();
        sout.Write(send, 0, send.Length);
        sout.Flush();
        sout.Close();

        WebResponse res = httpRequest.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());
        string returnvalue = sr.ReadToEnd();

        File.WriteAllText(@"C:\src\test.html", returnvalue);

However, the resulting html page that is created does not show the search results, it shows the initial search form.

I am assuming the post is failing. My questions are around post I am making.

Does action="./" mean it posts back to the same page?

Do I need to submit all the form values (or can I get away with only submitting one or two)?

Is there any way to infer what the correct post parameter names are from the form?

Or am I missing something completely about web scraping and submitting forms in server side code?

Community
  • 1
  • 1
Jamadan
  • 2,223
  • 2
  • 16
  • 25
  • 1
    Action = "./" refers to the default page in the current folder. So if the page is "www.foobar.com/search.html" the post will go to "www.foobar.com/". If the page is "www.foobar.com/search/index.html" the post will go to "www.foobar.com/search/". – Jack A. Mar 09 '16 at 20:03

1 Answers1

2

What I would suggest is not doing all of this work manually, but letting your computer take a bit of the workload. You can use a tool such as Fiddler and the Fiddler Request To Code Plugin in order to programmatically generate the C# code for duplicating the web request. You can then modify it to take whatever dynamic input you may need.

If this isn't the route you'd like to take, you should make sure that you are requesting this data with the correct cookies (if applicable) and that you are supplying ALL POST data, no matter how menial it may seem.

Patrick Bell
  • 769
  • 3
  • 15
  • Good suggestion to use something like Fiddler to help debug the issue. I would disagree with your assertion that all post data is required. That really depends on the code that receives the post request. On the other hand, it is a good idea to start with everything to reduce the likelihood of failure. – Jack A. Mar 09 '16 at 20:01
  • I'm aware that not all POST data is always required, but when debugging, it's usually nice to have it all there, just so you are certain that isn't the point of error. :) – Patrick Bell Mar 09 '16 at 20:04
  • @ext0 - thanks for the suggestion. I'd love to be able to submit with all post data, I'm just not sure what the param names are to submit the entire data. Must I install Fiddler or is there a chrome plugin or similar that would show the post data that I could replicate? (semi-rhetorical, I'm searching now, just fishing for suggestions) – Jamadan Mar 10 '16 at 18:03
  • Wait, I can see the form data in network in dev tools for Chrome. I think that may work. But will resort to Fiddler if not. Thanks – Jamadan Mar 10 '16 at 18:17