Looking for the most suitable crate that crawls through specific parts of the web

Hey guys I am looking for some crate that crawls through a specific site, including through password protected sites such as gmail, the idea is like I can supply my username and password so that it can go through my gmail and I can program it to do whatever I want it to do such as send email etc.

Is there such crate like this?

I believe there are some basic utilities for crawling, but I strongly doubt there's anything that knows how to log in to specific services.

Also, for Gmail and other Google products, note that Google takes specific measures to prevent bots from navigating their interfaces. Such a bot could stop working at any time. Have you considered using the Gmail API?

Right I see mate.

I could but what about for outlook does microsoft stop such bot activities?

For Outlook, you could use the Outlook REST API.

Ah right, what if I wanted to make my own bot/web crawler, how would I get started then?

Really, there's a few different steps for a bot in the general case.

  1. Determine which API calls the web interface makes to the server.
  2. Determine how the API gets its authentication data. Usually this is from a cookie stored when the user logs in.
  3. Use your favorite HTTP library (e.g., reqwest) to repeat those API calls. You may need to set the User-Agent to imitate a browser. This may be against the Terms of Service of the website.
  4. If you need to actually scrape data from a web interface, analyze where in the HTML tree the data is stored. Then, use an HTML-parsing crate such as scraper or kuchiki to extract that from the response text.

For crawlers in particular, make sure not to spam a site with several requests a second. That's how you get your IP address blocked.

Interesting mate

What if I wanted to completely do it from scratch like using no crate at all, would this be possible? And how difficult would this be?

If this were 10 years ago, I'd have said sure, use a UdpSocket to make DNS requests and a TcpStream to make HTTP requests, implement the relevant protocols yourself to whatever extent necessary, and write your own brittle HTML parser. But nowadays, everything's moved to being HTTPS-only, which means that you'd need to write your own TLS stack. For reference, rustls has 18.4 kSLOC, and it depends on ring with 12.3 kSLOC and webpki with 2.2 kSLOC. Reimplementing TLS is within the realm of possibility, but IMHO, it's far outside the scope of an individual project.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.