Hey guys I am looking for some crate that crawls through a specific site, including through password protected sites such as gmail, the idea is like I can supply my username and password so that it can go through my gmail and I can program it to do whatever I want it to do such as send email etc.
Is there such crate like this?
I believe there are some basic utilities for crawling, but I strongly doubt there's anything that knows how to log in to specific services.
Also, for Gmail and other Google products, note that Google takes specific measures to prevent bots from navigating their interfaces. Such a bot could stop working at any time. Have you considered using the Gmail API?
I could but what about for outlook does microsoft stop such bot activities?
For Outlook, you could use the Outlook REST API.
Ah right, what if I wanted to make my own bot/web crawler, how would I get started then?
Really, there's a few different steps for a bot in the general case.
- Determine which API calls the web interface makes to the server.
- Determine how the API gets its authentication data. Usually this is from a cookie stored when the user logs in.
- Use your favorite HTTP library (e.g.,
reqwest) to repeat those API calls. You may need to set the
User-Agent to imitate a browser. This may be against the Terms of Service of the website.
- If you need to actually scrape data from a web interface, analyze where in the HTML tree the data is stored. Then, use an HTML-parsing crate such as
kuchiki to extract that from the response text.
For crawlers in particular, make sure not to spam a site with several requests a second. That's how you get your IP address blocked.
What if I wanted to completely do it from scratch like using no crate at all, would this be possible? And how difficult would this be?
If this were 10 years ago, I'd have said sure, use a UdpSocket to make DNS requests and a TcpStream to make HTTP requests, implement the relevant protocols yourself to whatever extent necessary, and write your own brittle HTML parser. But nowadays, everything's moved to being HTTPS-only, which means that you'd need to write your own TLS stack. For reference,
rustls has 18.4 kSLOC, and it depends on
ring with 12.3 kSLOC and
webpki with 2.2 kSLOC. Reimplementing TLS is within the realm of possibility, but IMHO, it's far outside the scope of an individual project.