Finding a word in a web page using java

Question

I am trying to search a specific word in a specific web page, I'm using Java and Eclipse. The problem is that if I'm taking a web page with almost without content it works fine, but when I'm trying in a "big" web page it doesn't find the word.

for example: I am trying to find the word ["InitialChatFriendsList" in the web page: https://www.facebook.com, if it finds the word then print WIN!!!

Here is a full Java code:

public class BR4Qustion {               
    public static void main(String[] args) {
        BufferedReader br = null;
        try {
            URL url = new URL("https://www.facebook.com");  
            br = new BufferedReader(new InputStreamReader(url.openStream()));

            String foundWord = "[\"InitialChatFriendsList\"";          
            String sCurrentLine;

            while ((sCurrentLine = br.readLine()) != null) {
                String[] words = sCurrentLine.split(",");
                for (String word : words) {         
                    if (word.equals(foundWord)) {
                        System.out.println("WIN!!!");
                        break;
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null)
                    br.close();
            } catch (IOException ex) {
                System.out.println("*** IOException for URL : ");
            }
        }
    }
}

Why are you splitting the words by `,`? Why do you consider a string with a square bracket and two double quotes to be a "word" anyway? — RealSkeptic, Aug 27 '17 at 13:12
What do you mean by "*big webpages*"? Note that when using this code and visiting *facebook* for example, you are **not logged in** and reading the **starting page**! For debugging purpose you could just print the whole content of the page and check whether that is the content you expect. Because the code itself doesn't look wrong at the first glance. — Zabuzard, Aug 27 '17 at 13:14
@RealSkeptic Because I know th at this word should apeer in this web page `...,["InitialChatFriendsList",...` — , Aug 27 '17 at 13:14
Why not use `if (sCurrentLine.contains(foundWord))` instead of inner loop? — SMA, Aug 27 '17 at 13:14
Note that nowadays you should use **try-with-resources** like `try (BufferedReader br = ...) { ... }` which automatically closes the resource after usage. — Zabuzard, Aug 27 '17 at 13:15
@Zabuza Let's suppose that I'm runnig this on my own computer and I'm logged in to facebook in my browser — , Aug 27 '17 at 13:19
It is not enough to log in via your browser. That is what I want to tell you. Information regarding login status are saved inside cookies (browser specific) and the Java `url.openStream()` **does not access** those cookie information of your specific browser. I am pretty sure that for your Java code you are reading the starting page of facebook, thus the code does not work. Just check the content which is read by your `BufferedReader`, it is probably the starting page. You would need to login via your java app or hijack the session, for both exist APIs. — Zabuzard, Aug 27 '17 at 13:21
If you want to parse a restricted content( which has log in system) then you should some idea of SESSION — sharif2008, Aug 27 '17 at 13:22
But, when I look at the source of the web page the string `,["InitialChatFriendsList",` is there — , Aug 27 '17 at 13:25
When you say "*look at the source*" you mean from inside your browser where you have logged in? Yes, but "*Javas browser*" is not logged in to your facebook account. To convince yourself of the problem try to use a second browser and visit facebook. For example login with Google Chrome and then visit facebook with Internet Explorer. For IE you are not logged in. Login information is saved inside the browsers local data (cookies) and not shared among browsers. Java does not access this data, you would need to login via Java itself or hijack the session. — Zabuzard, Aug 27 '17 at 13:27
@Nehoral: Why are you trying to scrape the Facebook page instead of just using the Facebook API? Circumventing the restrictions of the API is usually a TOS violation, and "big" websites have various mechanisms in place to detect automated "users" (we usually consider them "attackers") and serve up different content to them. — Daniel Pryden, Aug 27 '17 at 13:28
@Nehoral: it may be helpful for you to use a tool like Wireshark and to compare what your web browser is sending to Facebook (and what it's getting back) compared with what your program is sending to Facebook and what it's getting back. I guarantee you that you'll find some significant differences there. — Daniel Pryden, Aug 27 '17 at 13:31

Zabuzard · Accepted Answer · 2017-08-27T15:17:55.787

Problem

Besides some small flaws with your code (you should use try-with-ressources and the new IO library NIO) it looks totally fine and does not seem to have a logical error.

You are facing a different problem here. When trying to read Facebook you first need to login to your account, else you will see the starting page:

I guess you think that it is enough to login from your browser (for example Google Chrome) but that is not the case. Login information gets saved inside the local storage of the specific browser you have used, for example in its cookies. We talk from a session.

Showcase

As a small experiment visit Facebook with your Google Chrome and login. After that visit it with Internet Explorer, it will not be logged in and you are reading the starting page again.

The same happens with your Java code, you are simply reading the starting page because for "Javas browser" you are not logged in already. You can just check it by dumping the content your BufferedReader is reading:

final URL url = new URL("https://www.facebook.com");
try (final BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()))) {
    // Read the whole page
    while (true) {
        final String line = br.readLine();
        if (line == null) {
            break;
        }

        System.out.println(line);
    }
}

Take a look at the output, it will probably be the source of the starting page.

Insights

After logging in to Facebook via my browser the website sends me the following cookies:

The highlighted c_user cookie is definitely relevant for the session. If I delete it and refresh the page then I am not logged in anymore.

Solution

In order to work your Java code would need to login itself, via filling the form and submitting it (or just by sending the corresponding POST request), then listening to the answer of Facebook and saving all those cookie information. However doing this by yourself would be a huge task, I would not recommend it. Instead you could use an API that emulates a browser from inside Java, for example HTMLUnit. Alternatively you could use libraries like Selenium with which you can control your favorite browser directly via its driver interface.

The other approach would be to hijack the session. There you try to extract the relevant cookie data from your browsers local files and recreate the cookie data inside your Java application, with the same content. Also a huge task without APIs for a non-expert.

Remarks

Now, very important, note that Facebook (and also other websites like Twitter) have a public available API (Facebook for Developers) which is designed to ease the interaction with automated software. There are of course also Java API Wrapper available, like Facebook4J. So you should just use those APIs if trying to scrape sites like Facebook.

Also note that many sites, also Facebook, state in their Terms of Service (TOS) that interaction via automated software which does not use their API is treated as violation of those terms. It could result in legal consequences.

An excerpt from the TOS:

Safety

You will not collect users' content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our prior permission.

score 0 · Answer 2 · answered Aug 27 '17 at 15:31

0

You could try to use Jsoup

This library allows you to connect and load a page to parse it.

Here is an example

answered Aug 27 '17 at 15:31

Xavier Bouclet

922
2
10
23