Deep Web or we can say that Deep Net which is also known as the hidden web, the Undernet and the invisible web. Deep Web is basically defines the content of the World Wide Web which is un-indexed or more classified.
Deep Web can’t be mixed with the “Dark internet” rather it has separate identity in the internet industry. Where Dark internet refers to the unreachable internet network hosts or that space of the internet which is unreachable or no longer accessible because of some confidential reasons through any of the relational databases or any networks, there Deep Web refers to those websites or web contents that are hidden or un-indexed for standard search engines.
So Deep Web stands for those websites and web contents that are inaccessible or hard to search through standard search engines. Searching in standard search engine is like seeing the surface of the ocean, but there is more meaningful information available inside the ocean. Similar hidden area in the internet is known as Deep Web.
Deep Web Resources
Deep Web Resources can be classified into some different categories, like:
- Dynamic Content: Some websites are made in dynamic way and their web pages are also generated dynamically. Those web pages contain their content in dynamic text boxes which doesn’t have any identity to link up and changes according to the requests.
- Unlinked Content: Some pages in a website aren’t linked with any other web page which is visible to the client or user. Those kinds of web pages neither have any back link from any other website or any other form.
- Private Web: Some websites doesn’t allow a user without signing in or registration to access their contents. This type of protection known as password protected content.
- Contextual Web: Which content needed verification every time after changing their users like: some websites prevent their access from outsiders of their IPs. If Client IP changes, the accessibility also affected.
- Limited Access Content: Some websites uses Captcha, ROBOT prevention codes and other programs to prevent their files and contents from access. This is also limited search engine crawlers to creating cache copies of that content.
- Scripted Content: Some websites uses scripting codes for accessing important contents. They used JavaScript, AJAX and other related scripting languages for decrypted contents.
- Non-HTML/Text Content: Non HTML/Text Content refers to that content format which is can’t accessible or readable via search engine crawlers like: Images or Videos. Those types of contents needed to specify through some HTML tags.
Deep Web Crawling
Generally to crawl a deep web page is a hard thing to do, but with the time, several algorithms and several programs are created to crawl deep web pages.
Firstly researchers created an architectural algorithm that used some key phrases provided by the users or auto generated queries for deep web search.
Secondly some scientist developed a hidden-web crawler that uses queries to search a web form from the websites and it also searched data servers for searching queries.
After that algorithm, some developers again did a great effort in the form of DeepPeep, it collected information for the requested queries from different domains deeply.
Other than these some standard search engines like Google, also tries to developed crawling techniques to crawl deep web resources. Those techniques are much familiar and useful now these days. Those are mod oai and sitemaps.
Through sitemaps, developers can add their unlinked and hidden web pages in a xml resource file that is readable only be search engine crawler.
Conclusion
Deep Web is the hidden portion of a website that is not crawled without any linking and developer hide them inside the website for protecting them from unauthorized users. Deep web resources or the content can be categorized in seven categories like: dynamic content, private content, unlinked content, and contextual content, Non html/text content, scripting content and limited access content. The crawling of deep web page can be done through adding their address in a xml file named as sitemap.