GOVCOM.ORG

1
1.1
1.2
1.3
2
3
4
4.1
4.2
4.3
4.4
5
5.1
5.2
5.3
5.4
5.5
5.6

Introduction
  Before you begin
  Quick Start
  Description of the Issue Crawler
Log in
The Lobby
Issue Crawler
  The Harvester (Step one)
  Crawler Settings (Step two)
  How the Crawler works
  Crawler Settings in detail
Network Manager and Archive
  Purpose
  Features
  Map Viewing and Interactivity
  Saving and Printing Maps
  Data Outputs
  Advanced Options - Map Generation

1. Introduction

Welcome to the Issue Crawler, the network mapping software by the Govcom.org Foundation, Amsterdam. This is the online documentation. (Auto-request an account at issuecrawler.net.) Issuecrawler.net has a FAQ and Scenarios of Use. There is also the Issue Crawler back-end movie (10 min.).

1.1 Before you begin

Windows / Internet Explorer Users: Download the svg viewer plug-in at http://www.adobe.com/svg. No plug-in is necessary for Firefox, Safari or Chrome users.

1.2 Quick start

Enter at least two related URLs in the Issue Crawler, harvest, name your crawl and launch your crawl. Crawls complete in 10 minutes to 8 hours, depending upon quantity of starting points. View map in Network Manager. Clicking node names opens URLs. Save from map options. Print map from saved file, such as pdf. (For printing from pdf, page set up should be landscape, and use 'actual size,' not fit to page.)

1.3 Description of the Issue Crawler

The IssueCrawler is web network location and visualization software. It consists of crawlers, analysis engines and visualisation modules. It is server-side software that crawls specified sites and captures the outlinks from the specified sites. Sites may be crawled and analyzed in three ways: co-link, snowball and inter-actor. Co-link analysis crawls the seed URLs and retains the pages that receive at least two links from the seeds. Snowball analysis crawls sites and retains pages receiving at least one link from the seeds. Inter-actor analysis crawls the seed URLs and retains inter-linking between the seeds. The Issue Crawler visualises the results in circle, cluster and geographical maps. For user tips, see also scenarios of use, available at http://www.govcom.org/scenarios_use.htm. For a list of articles resulting from the use of the Issue Crawler, see http://www.govcom.org/publications.html. Query scholar.google.com for issuecrawler or "issue crawler".

The following is a step by step guide to software use.

2. Log in

Enter Username and Password

Remember me? Checking the box has the software remember your username and password for future use. (A cookie is used.) Your browser also is able to remember your log-in details.

Forgot password? Type username or email address into username field, press login. A new password is sent to your email address, if you are a valid user.

Request account? Fill in as many fields as you feel comfortable with. Note how a user's privacy concerns have been built into the archive search, whilst still enabling an open archive. That is, one cannot search the archive for a user's name.

3. The Lobby

The Lobby is so named for the area where one waits for crawls to complete. Crawl completion time varies between 10 minutes and days, depending on the number of servers from which the crawler requests pages and the quantity of crawls in the queue.

Whilst waiting users may read news about the software. (News is posted by the administrators of the software.) Users also may view maps in the archive as well as launch additional crawls.

To the right is the listing of current and queued crawls. Crawls are either crawling or queued (i.e., ‘waiting to be launched’). Crawls run sequentially on parallel crawlers. Under details you may view the author, email address, settings and progress of the current crawls, as well as live views of the crawls. Estimated completion time may change significantly depending on current server load.

The User Manager is below the listing of current crawls. Users may change their username, password and email address.

4. Issue Crawler

The Issue Crawler is the crawler itself. There are two steps before launching a crawl.

4.1 The Harvester. (Step one)

The Harvester is so named for it strips URLs from text dumped into the space. For example, one may copy and paste a page of search engine returns into the Harvester. The Harvester strips away the text, leaving only URLs. It is a generally useful tool in itself. (See also FAQ.)

Type or paste at least two different URLs into the harvester, and press harvest. These harvested URLs will be crawled upon launching crawl.

Tip:
If you find a relevant link list on the Web without visible URLs, view page source, copy the code containing the URLs, paste into the Harvester and press Harvest. The Harvester will strip out the code leaving only URLs.

4.2 The Crawler Settings. (Step two)

Your harvested URLs appear in the box. You may edit and remove URLs. You may save your harvested results. This is also the stage where you provide the Crawler with instructions (the crawler settings), and where you name and launch your crawl.

Tips:
Once you have harvested:

Remove double entries by clicking on a URL, and pressing remove.

View starting points to ensure they are correct by clicking on a URL, and pressing view.

Should the URL be incorrect, edit the starting point by clicking the URL and pressing edit. Once edited, press update.

You may save your harvested results by pressing save results. A text file is created.

Should you wish to add URLs, save your results, return to the Harvester, and paste your saved results into the Harvester. Add URLs. Press Harvest.

4.3 Explanation of General Crawler Operation.

Co-link Analysis. The Issue Crawler crawls the specified starting points, captures the starting points’ outlinks, and performs co-link analysis to determine which outlinks at least two starting points have in common. The Issue Crawler performs these two steps (crawling and co-link analysis) once, twice or three times. Each performance of these two steps is called an iteration. Each iteration has the same crawl depth. The crawler respects robot exclusion files. Note: if you desire to see a site's robots exclusion policy, you may wish to consult http://tools.issuecrawler.net/robots/.

Snowball Analysis. The Issue Crawler crawls the specified starting points, captures the starting points' outlinks and retains them. This is one degree of separation. Subsequently capturing the outlinks of the retained URLs is the second degree of separation. By default the Issue Crawler snowball analysis captures two degrees of separation, and up to three in total.

Inter-actor Analysis. The Issue Crawler crawls the specified starting points, captures the starting points' outlinks and shows inter-linking between the starting points only. It also includes isolates, i.e., those starting points that have no received links from other starting points.

Tip:
1. Avoid crawling big media sites, search engines, pdf files, image files and pages, more generally, without specific outgoing links.

More specific crawler operation information is available in the FAQ.

4.4 Crawler Settings in Detail (Co-link analysis module)

Co-link Analysis. There are 4 settings. The default settings suffice to ensure a crawl. First-time users are encouraged to use one iteration of method. You must name your crawl before launching the crawler.

Privilege Starting Points: This setting keeps your starting points in the results after the first iteration. Privileging starting points (and using one iteration of method) are suggested for social network mapping. The software understands a social network as the starting points plus those organizations receiving at least two links from the starting points.

Perform co-link analysis by page or by site. Performing co-link analysis by page analyses deep pages, and returns networks consisting of pages. Performing co-link analysis by site returns networks consisting of sites or homepages only. Analysis by page is suggested, for the results are more specific, and the clickable nodes on the map are often 'deep pages' as opposed to homepages. The page on the site receiving the most inlinks is the clickable page.

Set iterations. One may set the number of iterations of method (crawling and co-link analysis) to one, two or three iterations. One iteration is suggested for social network mapping, two for issue network mapping and three for establishment network mapping. For a longer description of the distinction between networks, see also scenarios of use, http://www.govcom.org/scenarios_use.htm.

Crawl depth. One may crawl sites one, two or three layers deep.

Here is a strict definition of how depth is calculated.

The pages fetched from the starting point URLs are considered to be depth 0. The pages fetched from URL links from those pages are considered to be depth 1. In general, the pages found from URL links on a page of depth N are considered to be depth N+1. If you set a depth of 2, then no pages of depth 2 will be fetched. Only pages of depth 0 and 1 will be fetched (ie. two levels of depth). {Text by David Heath at Oneworld.}

Tips:
1. Use links pages as starting points. Links pages are the URLs where hyperlinks are listed, e.g., http://www.freetibet.org/info/links.html. Occasionally sites, using frames or other structures, are so designed that visitors may have the impression that they are always on the homepage. If, on the homepage, you notice a hyperlink to ‘links’ or ‘resources’, right-mouse click the ‘links’, copy location to clipboard, and paste into the harvester. Use as many links pages as possible for your starting points.
2. Give the crawler the least amount of work to do. Using a few links pages as starting points, with one iteration of method and one layer deep will provide the quickest crawl completion.
3. Before launching a crawl, name the crawl clearly. Name the crawl so that others viewing the archive will understand what it is. Viewing the archive will provide you with an understanding of crawls that have been named well or less so.

Ceilings (advanced). The crawled URL ceiling (per host) is the maximum quantity of URLs crawled on each host. The crawled URL ceiling (overall) is the total quantity of URLs crawled (max 60000). The co-link ceiling by page (pages per host per iteration) is the maximum quantity of co-linked pages returned per iteration (max 1000). The co-link ceiling by site (hosts per iteration) is the maximum quantity of co-linked sites returned per iteration (max 1000).

Crawl speed. You may modify the crawl speed. For example, if you are not in a hurry, set the crawl speed to low.

Exclusion list. There is a list of URLs to be excluded from crawling and thereby excluded from the results, e.g., software download pages, site stats counters, search engines and others. It is suggested that you keep your own list. You may edit the existing list. Please note the list format, and edit the list using the same format, i.e., www.google.com ; news.google.com.

Name and Launch crawl.
Name crawl before launch. Use a name that clearly identifies the network you seek. Once you have launched a crawl, your crawl details will appear. These include the name of your crawl, and the time and date launched.

5. Network Manager and Archive

5.1 Purpose of the Network Manager and Archive

The principle purpose of the Network Manager as well as the Archive is to allow you to generate, view, edit, save and print maps. You also may view and download issuecrawler data outputs, such as a ranked actor list for each network, and an actor interlinking matrix.

The Network Manager provides a list of your completed crawls. The Archive provides a list of all users’ completed crawls. The archive may be searched.

5.2 Features of the Network Manager and Archive

The Network Manager and the Archive have a number of features.

List of completed crawls. Listed are the network names and top five organizations in each network. Each network lists the top 5 URLs beneath the title of the network, with an inlink count in parentheses. The inlink count is the total number of links the organization or site has received from the network. It is a page count. Clicking on an organization (in the form of a shortened URL) places it in the archive search, and allows you to find all maps in the archive containing that organization (according to the homepage URL, without the www, such as greenpeace.org). It seems that worldbank.org currently appears in the most networks in the archive.

Network Selection - The Scheduler. You may schedule the network to repeat the crawl at specified intervals using either your original starting points or the network results. This allows you to watch the evolution of the network over time, either on your terms (scheduling a crawl using your starting points) or on the network’s terms (scheduling a crawl using last available network results).

Network Selection – View Map. You may view a depiction of your network as a circle or cluster map. You also may send your network results to the issuegeographer and make a geographical map. It plots site registration places to geographical map, using whois.net data.

Network Selection – Edit Map Name and Add Legend Text. You may change the name of the map and add a legend text by pressing the + sign below, editing and pressing save changes. The legend text will appear on the map.

Network Selection – Other Data Outputs. Available are: the xml source file; the raw data (comma separated); ranked network actors lists; an actor list with interlinkings (core network) and its equivalent non-matrix version; actor list with interlinkings (core network and periphery) and its equivalent non-matrix version; and the page list with their interlinkings (core and periphery). Each is detailed in 5.5.

5.3 Map Viewing and Interactivity

Map Viewing
Pressing View Depiction for a cluster map or a circle map generates a map. The map is generated as a scalable vector graphic (svg). The Internet Explorer browser requires a plug-in to view an svg file. An svg viewer plug-in is available at http://www.adobe.com/svg. Other W3C standards compliant browsers with native svg built it do not require the plug-in.

The map shows its name, author, crawl start and completion dates, as well as the crawler settings. It also loads statistics of the largest node on the map, by default. The largest node is the node that has received the most inlinks from the network actors.

Legend text may be added on the network details page.

The legend shows the top- and second-level domains ("node types") represented on the map.

For the cluster map, the placement of the nodes on the map is significant. Placement is relative to significance of the node to other nodes, according to the ReseauLu approach.

Map Interactivity
Clickable Node Names. Each node name on the map is clickable. Clicking a node name will open a pop-up window and retrieve the URL associated with the node name. Should you have run your crawl with the co-link analysis mode set to ‘by page’, often the nodes are ‘deep pages’.

Clickable Nodes
Selecting a node shows the destination URL, the node’s crawl inlink count, as well as its links to and from other network actors, in the statistics.

Clickable Node Types (domains and sub-domains)
You may turn on and off links to and from domains and sub-domains listed in the legend. You also may turn on and off links, using the drop-down menu.

Zooming and Panning. To zoom in, out and return to original view, ctl-mouse. To pan, press alt and drag.

5.4 Saving and Printing Maps

Saving Map.
Use the save and export option on the map.

Save the interactive .svg file for uploading to a site or for file transfer.
In order for the .svg file to load on your site, put a line in the mime-types configuration for your webserver that recognizes svg and outputs the correct content type to the web browser. It is standard with Apache.

Save the .jpg or .png file as flat image for pasting into a document or into html. Save the .tiff flat image for higher print quality. Save the .pdf file as document.

Printing Map.
Print from imported or saved file. Landscape orientation is advised. Printing from the browser also works but is not optimal. To print the .pdf, set document to "actual size."

5.5 Data Outputs

xml source file. This is the file used by the software to generate a map. It contains an [info] section at the top of the page, which shows crawl errors, should they have occurred. At the bottom of the file is the original starting point list.

raw data (comma separated). This is useful for importing data into other analytical software.

ranked actor list by inlink count from network (by page). A ranked actor list (by page count) is outputted from network results. It is useful for comparative analysis over time, in combination with the scheduler.

ranked actor list by inlink count from network (by site). A ranked actor list (by site count) is outputted from network results. By site is suggested for comparative network actor analysis over time, for the URLs are truncated, e.g., greenpeace.org.

ranked actor list by inlink count from crawled population.

actor list with interlinkings (core network). The page shows how the core actors interlink. There is both a matrix and a non-matrix version.

actor list with interlinkings (core network and periphery). The page shows how core and peripheral actors interlink. There is both a matrix and a non-matrix version.

page list with their interlinkings (core and periphery). The page shows all interlinkings between all pages.

5.6 Advanced Options - Map Generation

Circle Map - Advanced Options

Map Generation
Retaining the default setting will generate a map with a node count of approximately 25 or fewer nodes. You may raise or lower the node count. A node count reduction is equivalent to an authority threshold. You show nodes with increasingly higher or lower inlink counts.

Cluster Map - Advanced Options

Map Generation
The cluster map advanced options provides data about your network.

Choose nodes to be mapped allows you to choose the number of nodes to be mapped according to a significance measure, that is, the ‘top’ nodes according to inlink count per node.

Selection of ties by specificity is the qualitative strength of ties. The network clusters actors with strongest ties to one another.

Selection of ties by frequency is the quantitative force of ties. The network clusters actors with the greatest quantity of ties between them.

Color scheme by type indicates domain type, e.g., .gov, .co.uk, .gv.at. Color scheme by structural position indicates type of linking behavior, e.g., only gives links, only receives links, give and receives links.

Size of nodes by inlinks indicates that the size of the node is relative to the number of links received by the site or organization during the crawl.

Size of nodes by centrality indicates the size of the node is relative to number of of links given and received per cluster.