Text Extract

Summary: Context-sensitive search, grep, and extract text from multiple pages using search terms and regular expressions
Version: 2008-03-07
Prerequisites: PmWiki 2.2 beta
Status: new & experimental
Maintainer: HansB?

Questions answered by this recipe

How can I do context-sensitive searches through multiple pages showing results within lines or paragraphs?
How can I show content from different pages if the content matches specific query terms?
How can I have a context-sensitive Search form?
How can I search pages using regex like grep?

This section is optional; use it to indicate the types of questions (if any) this recipe is intended to answer.

Description

A markup expression for extracting text lines (paragraphs) from multiple pages using regular expressions and wildcard pagename patterns plus a search form markup directive.

Installation:

Download Attach:extract.php Δ, copy to cookbook folder and install in config.php with:

include_once("$FarmD/cookbook/extract.php");

Usage:

Markup syntax:

{(extract 'Text(Pattern)' PageName [PageName2] [PageName3] ... [keyword=value] [keyword=value] ...)}
Arguments:

Text(Pattern)

You can enter text strings or regex patterns. 'cat' will look for all occurrences of 'cat'. The default is a case-insensitive search, so any occurrence of 'Cat', 'CAT', 'cAt' etc. will also be returned. 'cat dog' will look for string 'cat dog'. To look for matches of 'cat' or 'dog' use 'cat|dog'. To match the word 'cat' and not 'catastrophe' use word regex boundary markers '\b': '\bcat\b'. Regex uses some characters as special control characters: the dot ., the star *, the question mark ?, the pipe |, the dollar $, and brackets. To use any of these as normal characters you need to escape them with a backslash in front.

The regex dot . character represent any character, so if you use a single dot as the textpattern the whole page content will be returned, as it matches everything.

To specifically exclude lines matching some text(pattern) put it into the cut= option. With the snip= option on the other hand you can prevent certain words or phrases being shown in any matching lines, but still get the line.

PageName source lists

Following the Text(Pattern) as the first argument, the second and following bare arguments (not keyword=value option parameters) are treated as page names, which can be given in form of PageName, or Group.PageName, and can include wildcard characters star * and question mark ?, ? representing any valid single character, and * representing any string of valid characters. So PmWiki.* will be interpreted as all pages in group PmWiki, *.RecentChanges means all RecentChanges pages. A page name with a minus - in front will be excluded from the pages to be searched, again wiki wildcard characters are allowed. Note that the wildcard pagename pattern is not a regex pattern, and a dot here means just the separator between the Group and PageName component of a page name!

Comma-separated lists of page names can also be given.

Instead of using all of a page as the source for the text extract, one can specify an anchor defined page section as source with Group.PageName#anchor, or a section between two anchors with Group.PageName#anchor1#anchor2. With this section syntax you cannot use wiki wildcards.

Search form markup

Markup (:extract:) will produce a search form with a field for entering search terms or a regular expression and a field for entering a page name or pagename with wildcards. Results will be shown with markup extractresult:).

Default parameters for markup (:extract:)

Other optional parameters

Comma-separated lists of page names can also be given.

Examples:

Default Search Form

(:extract:)

(:extractresult:)

Search PmWiki Documentation

(:extract page=PmWiki.*:)

(:extractresult:)

Notes

Fox Context Sensitive Search Form

The following Fox form can be used as general Search / Find / Text extraction tool (needs latest v.2008-01-25 fox.php):
Make sure to allow foxaction 'display' by setting $FoxPagePermissions['Group.PageName'] = 'display';
or add to Site.FoxConfig: Group.Pagename: display;

(:foxmessages:)
>>frame width=25em<<
(:fox form  foxaction=display target={*$FullName}:)
(:input default request=1:)
(:foxtemplate "{$$(extract '{$$search}' {$$pages}
   cut='{$$cut}' snip='{$$snip}' case='{$$case}' prefix=link header=full)}":)
|| Search for:||(:input text search size=30:) ||
|| On pages:||(:input text pages size=30:) ||
|| Cut lines incl.:||(:input text cut '' size=30:) ||
|| Snip text:||(:input text snip '' size=30:) ||
|| Case sensitive:(:input checkbox case '1':) ||||
|| ||(:input submit post Enter:) (:input submit cancel Cancel:) ||
(:foxend form:)
>><<

(:foxdisplay:)

The form displays the output to the position of markup (:foxdisplay:).

If you want to write the output into a page (instead of the expression), and want the output cleaned up without having the (:spacer:) markup written, use this in the template:

{$$(cleanspacer (extract .....))}

and change the fox form markup to:

(:fox form target=YourTargetPage:)

PmWiki Search Form

It is possible to use TextExtract with a standard PmWiki search form, but searches are a lot (3 to 4 times) slower. Perhaps this is useful in some situations were it is necessary to use some pagelist options, which TextExtract does not supply.

Create a custom pagelist template in site.LocalTemplates:

!!!#extract
[[#extract]]
{(extract dummy {=$FullName} prefix=link)}
[[#extractend]]

and use a searchbox form for instance like this:

Search the PmWiki Documentation

(:searchbox group=PmWiki fmt=#extract:)

(:searchresults:)

(:searchbox ...:) can take all standard pagelist parameters.

Release Notes

If the recipe has multiple releases, then release notes can be placed here. Note that it's often easier for people to work with "release dates" instead of "version numbers".

See Also

Contributors

Comments