Hi There,
Those who know me, Know that i run Bongoza, The currently close to non-existant search engine. Well, That will change in a few months once i get my server up and Running.
Anyway… For the crawler you need:
- Snoopy Agent (Fetches the Webpage, Extracts the Text, Strips the Links, etc)
- Web Server with PHP
- Some Time (And maybe more to tweak Snoopy)
Crawling Process:
- Fetch Page main page
- Strip Links found on Page, Put in Array for links to be fetched
- Parse page and extract data (Text, Meta Tags, Size, etc.)
- Index Page
- Start fetching pages from Array of links to be fetched. Repeat as Above.
Well, Thats pretty much the idea. Will be posting the code after i get some time to completely finish the code with a lot more features like robots.txt support and all the other features i can think of.







