java.lang.Object | +----java.util.Vector | +----BRobotsTxt
This class uses a weak implementation of the Robot Exclusion Standard
as it is used commonly throughout the web. There is an Internet-Draft
for a more rigid version. Since this is yet not a standard and the author
believes that the existing convention offers good oppotunity to protect
websites from being indexed unwillingly, the following implementation
fits the requirements ;-)
For more information on this topic vistit
http://info.webcrawler.com/mak/projects/robots/.
View the java-source file:
The program works as follows:
- For each url look for the File robot.txt
- parse the file and check if the URL if forbidden for robots
If requred here could follow an important part of the code (that explayns something! this is just an example): private String[] init(URL url, String robotName) { Vector v = getRobotFile(url, robotName.toLowerCase()); String[] stringArray = new String[v.size()]; v.copyInto(stringArray); return stringArray; }
public BRobotsTxt(URL url, String robotName) throws NoRobotInformationException
public boolean isAllowed(URL url)