Welcome to the Issue Crawler, the network
mapping software by the Govcom.org Foundation,
Amsterdam. This is the online documentation.
(Auto-request an account at issuecrawler.net.)
Issuecrawler.net has a FAQ
and Scenarios of Use. There is also the Issue
Crawler back-end movie (10 min.).
Windows / Internet Explorer Users: Download
the svg viewer plug-in at http://www.adobe.com/svg.
No plug-in is necessary for Firefox, Safari or Chrome users.
Enter at least two related
URLs in the Issue Crawler, harvest, name your
crawl and launch your crawl. Crawls complete
in 10 minutes to 8 hours, depending upon quantity
of starting points. View map in Network Manager.
Clicking node names opens URLs. Save from
map options. Print map from saved file, such
as pdf. (For printing from pdf, page set up
should be landscape, and use 'actual size,'
not fit to page.)
The IssueCrawler
is web network location and visualization software.
It consists of crawlers, analysis
engines and visualisation modules. It
is server-side software that crawls specified
sites and captures the outlinks from the specified
sites. Sites may be crawled and analyzed in three ways: co-link, snowball and inter-actor. Co-link analysis crawls the seed URLs and retains the pages that receive at least two links from the seeds. Snowball analysis crawls sites and retains pages receiving at least one link from the seeds. Inter-actor analysis crawls the seed URLs and retains inter-linking between the seeds. The Issue Crawler
visualises the results in circle, cluster
and geographical maps. For user tips,
see also scenarios of use, available at http://www.govcom.org/scenarios_use.htm.
For a list of articles resulting from the
use of the Issue Crawler, see http://www.govcom.org/publications.html. Query scholar.google.com for issuecrawler or "issue crawler".
The following is
a step by step guide to software use.
Enter Username
and Password
Remember me? Checking the
box has the software remember your username
and password for future use. (A cookie is
used.) Your browser also is able to remember
your log-in details.
Forgot password? Type username
or email address into username field, press
login. A new password is sent to your email
address, if you are a valid user.
Request account? Fill in as many fields as you feel comfortable with. Note how a user's privacy concerns have been built into the archive search, whilst still enabling an open archive.
That is, one cannot search the archive for a user's name.
The Lobby is so named for
the area where one waits for crawls to complete.
Crawl completion time varies
between 10 minutes and days, depending on
the number of servers from which the crawler
requests pages and the quantity of crawls
in the queue.
Whilst waiting users may read news
about the software. (News is posted by the
administrators of the software.) Users also
may view maps in the archive
as well as launch additional crawls.
To the right is the listing of current
and queued crawls. Crawls are either
crawling or queued (i.e., ‘waiting to
be launched’). Crawls run sequentially
on parallel crawlers. Under details
you may view the author, email address, settings
and progress of the current crawls, as well
as live views of the crawls.
Estimated completion time may change significantly
depending on current server load.
The User Manager is below
the listing of current crawls. Users may change
their username, password and email address.
The Issue Crawler is the
crawler itself. There are two steps
before launching a crawl.
4.1
The Harvester. (Step one)
The Harvester is so named for it strips
URLs from text dumped into the space.
For example, one may copy and paste a page
of search engine returns into the Harvester.
The Harvester strips away the text, leaving
only URLs. It is a generally useful tool in
itself. (See also FAQ.)
Type or paste at least two different
URLs into the harvester, and press
harvest. These harvested URLs will be crawled
upon launching crawl.
Tip:
If you find a relevant link list on the Web without visible URLs, view page
source, copy the code containing the URLs,
paste into the Harvester and press Harvest.
The Harvester will strip out the code leaving
only URLs.
4.2
The Crawler Settings. (Step two)
Your harvested URLs appear in the
box. You may edit and remove URLs. You may
save your harvested results. This is also
the stage where you provide the Crawler with
instructions (the crawler settings), and where
you name and launch your crawl.
Tips:
Once you have harvested:
Remove double entries by clicking
on a URL, and pressing remove.
View starting points to ensure they
are correct by clicking on a URL, and pressing
view.
Should the URL be incorrect, edit
the starting point by clicking the URL and
pressing edit. Once edited, press
update.
You may save your harvested results
by pressing save results.
A text file is created.
Should you wish to add URLs,
save your results, return to the Harvester,
and paste your saved results into the Harvester.
Add URLs. Press Harvest.
4.3 Explanation
of General Crawler Operation.
Co-link Analysis. The Issue Crawler crawls the specified starting
points, captures the starting points’
outlinks, and performs co-link analysis to
determine which outlinks at least two starting
points have in common. The Issue Crawler performs
these two steps (crawling and co-link analysis)
once, twice or three times. Each performance
of these two steps is called an iteration.
Each iteration has the same crawl depth. The
crawler respects robot exclusion files. Note:
if you desire to see a site's robots exclusion policy,
you may wish to consult http://tools.issuecrawler.net/robots/.
Snowball Analysis. The Issue Crawler crawls the specified starting points, captures the starting points' outlinks and retains them. This is one degree of separation. Subsequently capturing the outlinks of the retained URLs is the second degree of separation. By default the Issue Crawler snowball analysis captures two degrees of separation, and up to three in total.
Inter-actor Analysis. The Issue Crawler crawls the specified starting points, captures the starting points' outlinks and shows inter-linking between the starting points only. It also includes isolates, i.e., those starting points that have no received links from other starting points.
Tip:
1. Avoid crawling big media sites,
search engines, pdf files, image files and
pages, more generally, without specific outgoing
links.
More specific crawler operation information
is available in the FAQ.
4.4 Crawler Settings in Detail (Co-link analysis module)
Co-link Analysis. There are 4 settings. The default
settings suffice to ensure a crawl.
First-time users are encouraged to use one
iteration of method. You must name
your crawl before launching the crawler.
Privilege Starting Points:
This setting keeps your starting points in
the results after the first iteration. Privileging
starting points (and using one iteration of
method) are suggested for social network mapping.
The software understands a social network
as the starting points plus those organizations
receiving at least two links from the starting
points.
Perform co-link analysis by page or
by site. Performing co-link analysis
by page analyses deep pages, and returns networks
consisting of pages. Performing co-link analysis
by site returns networks consisting of sites
or homepages only. Analysis by page is suggested,
for the results are more specific, and the
clickable nodes on the map are often 'deep
pages' as opposed to homepages. The page on
the site receiving the most inlinks is the
clickable page.
Set iterations. One may
set the number of iterations of method (crawling
and co-link analysis) to one, two or three
iterations. One iteration is suggested for
social network mapping, two for issue network
mapping and three for establishment network
mapping. For a longer description of the distinction
between networks, see also scenarios of use,
http://www.govcom.org/scenarios_use.htm.
Crawl depth. One may crawl
sites one, two or three layers deep.
Here is a strict definition of how
depth is calculated.
The pages fetched from the starting point
URLs are considered to be depth 0. The pages fetched from URL links
from those pages are considered to be depth
1. In general, the pages found from URL links
on a page of depth N are considered to be
depth N+1. If you set a depth of 2, then no
pages of depth 2 will be fetched. Only pages
of depth 0 and 1 will be fetched (ie. two
levels of depth). {Text by David Heath at
Oneworld.}
Tips:
1. Use links pages as starting
points. Links pages are the URLs where hyperlinks
are listed, e.g., http://www.freetibet.org/info/links.html.
Occasionally sites, using frames or other
structures, are so designed that visitors
may have the impression that they are always
on the homepage. If, on the homepage, you
notice a hyperlink to ‘links’
or ‘resources’, right-mouse click
the ‘links’, copy location to
clipboard, and paste into the harvester. Use
as many links pages as possible for your starting
points.
2. Give the crawler the least amount
of work to do. Using a few links
pages as starting points, with one iteration
of method and one layer deep will provide
the quickest crawl completion.
3. Before launching a crawl, name
the crawl clearly. Name the crawl
so that others viewing the archive will understand
what it is. Viewing the archive will provide
you with an understanding of crawls that have
been named well or less so.
Ceilings (advanced). The crawled
URL ceiling (per host) is the maximum quantity
of URLs crawled on each host. The crawled URL
ceiling (overall) is the total quantity of URLs
crawled (max 60000). The co-link ceiling by
page (pages per host per iteration) is the maximum
quantity of co-linked pages returned per iteration
(max 1000). The co-link ceiling by site (hosts
per iteration) is the maximum quantity of co-linked
sites returned per iteration (max 1000).
Crawl speed.
You may modify the crawl speed. For example,
if you are not in a hurry, set the crawl speed
to low.
Exclusion list.
There is a list of URLs to be excluded from
crawling and thereby excluded from the results,
e.g., software download pages, site stats
counters, search engines and others. It is
suggested that you keep your own list. You
may edit the existing list. Please note the
list format, and edit the list using the same
format, i.e., www.google.com ; news.google.com.
Name and
Launch crawl.
Name crawl before launch. Use a name that
clearly identifies the network you seek. Once
you have launched a crawl, your crawl details
will appear. These include the name of your
crawl, and the time and date launched.
The principle purpose of the Network Manager
as well as the Archive is to allow you to
generate, view, edit,
save and print maps. You also may
view and download issuecrawler data outputs,
such as a ranked actor list for each network,
and an actor interlinking matrix.
The Network Manager provides a list of your
completed crawls. The Archive provides a list
of all users’ completed crawls.
The archive may be searched.
of the
Network Manager and Archive
The Network Manager and the Archive have a
number of features.
List of completed crawls. Listed
are the network names and top five organizations
in each network. Each network lists the top
5 URLs beneath the title of the network, with
an inlink count in parentheses. The inlink
count is the total number of links the organization
or site has received from the network. It
is a page count. Clicking on an organization
(in the form of a shortened URL) places it
in the archive search, and allows you to find
all maps in the archive containing that organization
(according to the homepage URL, without
the www, such as greenpeace.org). It seems
that worldbank.org currently appears in the
most networks in the archive.
Network Selection - The Scheduler.
You may schedule the network to repeat the
crawl at specified intervals using either
your original starting points or the network
results. This allows you to watch
the evolution of the network over time,
either on your terms (scheduling a crawl using
your starting points) or on the network’s
terms (scheduling a crawl using last available
network results).
Network Selection – View Map.
You may view a depiction of your network as
a circle or cluster map. You also may send
your network results to the issuegeographer
and make a geographical map. It plots site
registration places to geographical map, using
whois.net data.
Network Selection – Edit Map
Name and Add Legend Text. You may
change the name of the map and add a legend
text by pressing the + sign below, editing
and pressing save changes. The legend text
will appear on the map.
Network Selection – Other Data
Outputs. Available are: the xml source
file; the raw data (comma separated); ranked
network actors lists; an actor list with interlinkings
(core network) and its equivalent non-matrix
version; actor list with interlinkings (core
network and periphery) and its equivalent
non-matrix version; and the page list with
their interlinkings (core and periphery).
Each is detailed in 5.5.
Map Viewing
Pressing View Depiction for a cluster map
or a circle map generates a map. The map is
generated as a scalable vector graphic (svg).
The Internet Explorer browser requires a plug-in
to view an svg file. An svg viewer plug-in
is available at http://www.adobe.com/svg.
Other W3C standards compliant browsers with
native svg built it do not require the plug-in.
The map shows its name, author, crawl
start and completion dates, as well
as the crawler settings. It also loads statistics
of the largest node on the map, by default.
The largest node is the node that has received
the most inlinks from the network actors.
Legend text may be added
on the network details page.
The legend shows the top-
and second-level domains ("node types")
represented on the map.
For the cluster
map, the placement of the nodes on
the map is significant. Placement
is relative to significance of the node to
other nodes, according to the ReseauLu
approach.
Map Interactivity
Clickable Node Names. Each
node name on the map is clickable. Clicking
a node name will open a pop-up window and
retrieve the URL associated with the node
name. Should you have run your crawl with
the co-link analysis mode set to ‘by
page’, often the nodes are ‘deep
pages’.
Clickable Nodes
Selecting a node shows the destination URL,
the node’s crawl inlink count, as well
as its links to and from other network actors,
in the statistics.
Clickable Node Types (domains
and sub-domains)
You may turn on and off links to and from
domains and sub-domains listed in the legend.
You also may turn on and off links, using
the drop-down menu.
Zooming and Panning. To zoom
in, out and return to original view, ctl-mouse.
To pan, press alt and drag.
Saving Map.
Use the save and export option on the map.
Save the interactive .svg file
for uploading to a site or for file transfer.
In order for the .svg file to load on your
site, put a line in the mime-types configuration
for your webserver that recognizes svg and
outputs the correct content type to the web
browser. It is standard with Apache.
Save the .jpg or .png file as flat
image for pasting into a document
or into html. Save the .tiff flat image for
higher print quality. Save the .pdf file as
document.
Printing
Map.
Print from imported or saved file. Landscape
orientation is advised. Printing from the
browser also works but is not optimal. To
print the .pdf, set document
to "actual size."
xml source file.
This is the file used by the software to generate
a map. It contains an [info] section at the
top of the page, which shows crawl errors,
should they have occurred. At the bottom of
the file is the original starting point list.
raw data (comma separated). This
is useful for importing data into other analytical
software.
ranked actor list by inlink count
from network (by page). A
ranked actor list (by page count) is outputted
from network results. It is useful for comparative
analysis over time, in combination with the
scheduler.
ranked actor list by inlink count
from network (by site). A
ranked actor list (by site count) is outputted
from network results. By site is suggested
for comparative network actor analysis over
time, for the URLs are truncated, e.g., greenpeace.org.
ranked actor list by inlink count
from crawled population.
actor list with interlinkings
(core network). The page shows how the core
actors interlink. There is both a matrix and
a non-matrix version.
actor list with interlinkings
(core network and periphery). The page shows
how core and peripheral actors interlink.
There is both a matrix and a non-matrix version.
page list with their interlinkings
(core and periphery). The page shows all interlinkings
between all pages.
Map Generation
Retaining the default setting will
generate a map with a node count of approximately
25 or fewer nodes. You may raise or lower
the node count. A node count
reduction is equivalent to an authority threshold.
You show nodes with increasingly higher or
lower inlink counts.
Cluster
Map
Map Generation
The cluster map advanced options provides
data about your network.
Choose nodes to be mapped allows you to choose
the number of nodes to be mapped
according to a significance measure, that
is, the ‘top’ nodes
according to inlink count per node.
Selection of ties by specificity
is the qualitative strength
of ties. The network clusters actors with
strongest ties to one another.
Selection of ties by frequency
is the quantitative force of ties. The network
clusters actors with the greatest quantity
of ties between them.
Color scheme by type indicates
domain type, e.g., .gov, .co.uk, .gv.at. Color
scheme by structural position indicates
type of linking behavior,
e.g., only gives links, only receives links,
give and receives links.
Size of nodes by inlinks
indicates that the size of the node is relative
to the number of links received by the site
or organization during the crawl.
Size of nodes by centrality
indicates the size of the node is relative
to number of of links given and received per
cluster.
|