[English | Japanese]

Namazu 2.0 tutorial


This tutorial is for users who begin using Namazu 2.0.

Table of Contents

Mission

This tutorial is written for

in order to reduce the workload when using Namazu. Please refer manual to learn all features in Namazu. Also, installation guide is given in INSTALL file.

History of development

History of Namazu development from 1.3.0.x through 2.0 is as follows.

1.3.0.x
Old stable version. Recommend to use 1.3.0.11, since the versions 1.3.0.10 or earlier may create junk files from outside.
1.3.0.11 is the most current version.
1.3.1.0
Development version. Introduce a check point function (-s option: mknmz periodically "exec" itself to prevent the explosion of process.) However, this version was not released to the public and the development was transferred to 1.4.0.0.
1.4.0.0
Development version. Improve performance using Perl modules
However, this version was not released to the public and the development was transferred to 1.9.x
1.9.x
Development version. In-progress versions that are released during the development of version 2.0
since the versions
2.0
Stable version since 2000/02.
current
In-progress/On-going/Current(??) versions current can be obtained by CVS.

Namazu components

Namazu consists of three major components, mknmz, namazu, namazu.cgi.

Preparation and make

You need the following softwares to build Namazu 2.0.

NameDescription Status Current VersionRequired Version File name Development and Distribution Sources(Example) Others
PerlPerl Language Required5.8.8 >= 5.004 perl5.005_03.tar.gz Larry Wall GNU CPAN CPAN
make maintain groups of programs
3.81
make-3.81.tar.gz FSF GNU Required, when it cannot compile by make of system attachment.
gettext translate message Required only because of a multi-language message.0.14.6>= 0.13.1 gettext-0.14.6.tar.gz FSF GNU Solaris is indispensable.
nkfNetwork Kanji Filter for Japanese processing only2.0.7>= 1.71 nkf207.tar.gz Shinji Kono
Rei FURUKAWA
nkf_utf8 avoid using version 1.90, 1.92, 2.0.0 - 2.0.3 (See notes)
NKFnkf Perl Module for Japanese processing only. ++2.0.7>= 1.71
KAKASI Japanese/Romaji Conversionfor Japanese processing only. **2.3.4>= 2.x kakasi-2.3.4.tar.gz KAKASI Project namazu.org
Text::Kakasi KAKASI Perl Modulefor Japanese processing only. ++ 2.04>= 1.05 Text-Kakasi-2.04.tar.gz NOKUBI Takatsugu
Dan Kogai
CPAN dist
ChaSen (ChaSen) -- Japanese Morphology Analyzer for Japanese processing only. ** 2.3.3>= 2.0x chasen-2.3.3.tar.gz Nara Institute of Science and Technology Distribution Policy For libchasen.a in ChaSen 2.02 or earlier, refer below.
Text::ChaSen ChaSen Perl Module for Japanese processing only. ++ 1.03<= Text-ChaSen-1.03.tar.gz NOKUBI Takatsugu Text::ChaSen
MeCab Yet Another Japanese Morphology Analyzer for Japanese processing only. ** 0.93>= 0.6 mecab-0.93.tar.gz Taku Kudo MeCab from Namazu 2.0.15 (It corresponds since Namazu 2.0.16 since MeCab 0.90.)
mecab-perl MeCab Perl Module for Japanese processing only. ++ 0.93>= 0.76 mecab-perl-0.93.tar.gz Taku Kudo MeCab from Namazu 2.0.15 (It corresponds since Namazu 2.0.16 since MeCab 0.90.)
File::MMagic File Type Included1.27>= 1.20 File-MMagic-1.27.tar.gz NOKUBI Takatsugu CPAN dist This is packaged in Namazu distribution.

(Notes listed below are for Japanese processing only.)

Japanese Environment

Since 2.0.6, the handling of environment variables was changed. Besides, new command line option was added in mknmz.

environment variables

To use Namazu 2.0 under Japanese environment, you may need to set up environment variables for language selection.

With 2.0.5 (or earlier), the same environment variables were used to switch for both message translations and internal text processing.

Environment variable names for language selection (priority with left to right)
Message translations LANGUAGE LC_ALL LC_MESSAGES LANG
Text processing LANGUAGE LC_ALL LC_MESSAGES LANG

With 2.0.6, We modified as follows.

Environment variable names for language selection (priority with left to right)
Message Translations LANGUAGE LC_ALL LC_MESSAGES LANG
Text processing
LC_ALL LC_CTYPE LANG

The typical example to process Japanese is to set following values, depending on your system environment.

Setting language Sample
Unix OSja
Windowsja_JP.SJIS

The actual command to set value show above may again depend your shell,

C shellBourne shell etc
setenv LANG ja LANG=ja; export LANG

With above example, value(ja) is set for LANG, and all the processing will be for Japanese. Some system may require ja_JP, ja_JP.eucJP, ja_JP.EUC, ja_JP.ujis instead of just ja.

If the variables are not properly set when mknmz is executed, the resulting index files are not in good shape. If you browse one of the file, NMZ.w, supposed to have one (Japanese) word per line, instead, you have long sentence not segmented in each line. In that case, namazu or namazu.cgi execution will not show you the correct results.

--indexing-lang command line option (mknmz)

Since 2.0.6, the --indexing-lang=LANG option has been added in mknmz command.

You can specify language-processing-type with the option like --indexing-lang=ja (command line option given overrides environment variable).

Test before "make install"

If you wish to test mknmz before make install, do
cd namazu-2.0.x ( ... where you have unpacked *.tar.gz)
env pkgdatadir=`pwd` scripts/mknmz (in case csh/tcsh)
or
pkgdatadir=. scripts/mknmz (in case with sh/bash).
These will refer adjacent pl,filter,template etc, not exisiting stuff under /usr/local/share/namazu etc).

(To know more about this, see $PKGDATADIR variable in mknmz etc.)

You may try following examples for the first time to see the configuration, help, and to generate indexes for ~/Mail stuff, respectively.

    ./mknmz -C
    ./mknmz --help
    ./mknmz -O /tmp ~/Mail

Help Menu

If you just type mknmz or namazu with no argument, a short usage will be displayed. If you feed --help as an argument, a long usage will be displayed. The option -C will display the configurations at the time. Useful to remember these 3 option usages.

How to get help menus in command-line
ArgumentMeaningOther Arguments
None Short UsageCannot add any argument
--helpLong Usage Ignores other arguments
-C Configurations Other arguments will have meanings.

Running mknmz

First, create index. (If you wish to run mknmz before make install, please see Test before mknmz make install)
Format are changed slightly from versions 1.4.0.8. URI replacement is dealt with by specifying --replace option. URI replacement can be done during namazu/namazu.cgi execution. In this case, run mknmz without --replace option, and setup .namazurc so that URI replacement is performed during namazu/namazu.cgi execution.

Run mknmz as follows.

mknmz [options] target directory

The above example creates index in the current directory. Use -O option to specify the output directory.

For example,

      mkdir /tmp/index
      mknmz -O /tmp/index \
      --replace='s#/foo/bar/doc/#http://foo.bar.jp/software/#' \
      /foo/bar/doc

mknmz will output the following messages during the creation of index. If you wish to display messages in Japanese, please refer to Japanese Environment.


    14 files are found to be indexed.
    1/14 - /foo/bar/acrobat3.pdf [application/pdf]
    2/14 - /foo/bar/excel97.xls [application/excel]
    3/14 - /foo/bar/html.html [text/html]
    4/14 - /foo/bar/mail-multipart.txt [message/rfc822]
    5/14 - /foo/bar/mail.txt [message/rfc822]
    6/14 - /foo/bar/man.1 [text/x-roff]
    7/14 - /foo/bar/msg00000.html [text/html; x-type=mhonarc]
    8/14 - /foo/bar/plain.txt [text/plain]
    9/14 - /foo/bar/plain.txt.Z [text/plain]
    10/14 - /foo/bar/plain.txt.bz2 [text/plain]
    11/14 - /foo/bar/plain.txt.gz [text/plain]
    12/14 - /foo/bar/rfc0000.txt [text/plain; x-type=rfc]
    13/14 - /foo/bar/tex.tex [application/x-tex]
    14/14 - /foo/bar/word97.doc [application/msword]
    Writing index files...
    [Base]
    Date:                Thu Mar 16 22:14:01 2000
    Added Documents:     14
    Size (bytes):        58,701
    Total Documents:     14
    Added Keywords:      95
    Total Keywords:      95
    Wakati:              module_kakasi -ieuc -oeuc -w
    Time (sec):          14
    File/Sec:            1.00
    System:              linux
    Perl:                5.00503
    Namazu:              2.0.X

Customizing mknmz

Namazu was originally developed for processing HTML documents, Namazu can now deal with various document styles. You will find useful scripts in /usr/local/share/namazu/filter, and detailed explanation will be found in Document filters in Namazu manual.

Mails in MH format
run mknmz
% mknmz ~/Mail/foobar
MHonArc
Namazu will do specific processing for MHonArc HTML.
hnf
.mknmzrc for hnf and guide can be obtained from Hyper NIKKI System
Documents stored in other machines
Cannot search documents using Namazu alone. Need to use other tools (eg. wget, NFS) that transfer the documents in combination.

For mknmz command-line arguments, you get usage information from mknmz --help. With -C option, you get the configurations of the time.


    Loaded rcfile: /home/foobar/.mknmzrc
    System: linux
    Namazu: 2.0.X
    Perl: 5.00503
    File-MMagic: 1.27
    NKF: module_nkf
    KAKASI: module_kakasi -ieuc -oeuc -w
    ChaSen: module_chasen -i e -j -F "%m "
    MeCab: module_mecab -Owakati -b 8192
    Wakati: module_kakasi -ieuc -oeuc -w
    Lang_Msg: C
    Lang: C
    Coding System: euc
    CONFDIR: /usr/local/etc/namazu
    LIBDIR: /usr/local/share/namazu/pl
    FILTERDIR: /usr/local/share/namazu/filter
    TEMPLATEDIR: /usr/local/share/namazu/template
    Supported media types:   (42)
    Unsupported media types: (2) marked with minus (-) probably missing application in your $path.
      application/excel: excel.pl
      application/gnumeric: gnumeric.pl
      application/ichitaro5: taro56.pl
      application/ichitaro6: taro56.pl
      application/ichitaro7: taro7_10.pl
      application/macbinary: macbinary.pl
      application/msword: msword.pl
      application/pdf: pdf.pl
      application/postscript: postscript.pl
      application/powerpoint: powerpoint.pl
      application/rtf: rtf.pl
      application/vnd.kde.kivio: koffice.pl
      application/vnd.kde.kpresenter: koffice.pl
      application/vnd.kde.kspread: koffice.pl
      application/vnd.kde.kword: koffice.pl
      application/vnd.oasis.opendocument.graphics: ooo.pl
      application/vnd.oasis.opendocument.presentation: ooo.pl
      application/vnd.oasis.opendocument.spreadsheet: ooo.pl
      application/vnd.oasis.opendocument.text: ooo.pl
      application/vnd.sun.xml.calc: ooo.pl
      application/vnd.sun.xml.draw: ooo.pl
      application/vnd.sun.xml.impress: ooo.pl
      application/vnd.sun.xml.writer: ooo.pl
      application/x-apache-cache: apachecache.pl
      application/x-bzip2: bzip2.pl
      application/x-compress: compress.pl
    - application/x-deb: deb.pl
    - application/x-dvi: dvi.pl
      application/x-gzip: gzip.pl
      application/x-js-taro: taro7_10.pl
      application/x-rpm: rpm.pl
      application/x-tex: tex.pl
      application/x-zip: zip.pl
      audio/mpeg: mp3.pl
      message/news: mailnews.pl
      message/rfc822: mailnews.pl
      text/hnf: hnf.pl
      text/html: html.pl
      text/html; x-type=mhonarc: mhonarc.pl
      text/html; x-type=pipermail: pipermail.pl
      text/plain
      text/plain; x-type=rfc: rfc.pl
      text/x-hdml: hdml.pl
      text/x-roff: man.pl

Targets of index creation

short namelong namedescription
-F--target-list=FILEread in list of target files for index creation
-t--media-type=MTYPEspecify the document format of target files
--allow=PATTERN specify the regular expression of target file names.
--deny=PATTERN specify the regular expression of to-be-excluded file names.
--exclude=PATTERNspecify the regular expression of to-be-excluded path names.

Running namazu

To search documents, do

      % namazu query index

If you omit index, namazu will assume /usr/local/var/namazu/index as target.

Set up for namazu command will be done in namazurc. An example of namazurc can be found in /usr/local/etc/namazu/namazurc-sample in Namazu distribution package.

To use CGI on the web, you need to do various configuration. For Apache (Configuration)

ScriptAlias /cgi-bin/ /usr/local/apache/cgi-bin/ directory alias to /cgi-bin/ in URI
AddHandler cgi-script .cgi execute cgi for files ending with ".cgi"
AllowOverride All Allow .htaccess configuration (Web administrator)
Options ExecCGI Allow cgi-bin execution
DirectoryIndex index.html file name to display when specifying directory in URI

.htaccess can do configurations other than the one indicated by (Web administrator). (Note that these configuration may be forbidden in Apache configuration.)

What you can do with Namazu

What is written here is not "guarantee". Just introduce the advanced usage that developers have in mind.

What you cannot do with Namazu

Others

Targets of index creation
Which files will be target for index creation in the specified "target directory" will depend on the (mknmzrc's) $ALLOW_FILE and/or $DENY_FILE directives, or -a, --allow, --deny, --exclude command-line options.
For mew-1.94b2x and mew-nmz.el,
mew works in combination with namazu; the features such as are coded in contrib/mew-nmz.el, and you can find further information in contrib/00readme-namazu.jis

Terminology

KAKASI
Software to convert Kanji to Hiragana/Katakana/Ro-maji. Namazu uses this as a segmentation tool.
ChaSen
Japanese morphological analyzer. Namazu uses this as a segmentation tool.
MeCab
MeCab is yet another part-of-speech and morphological analyzer which developed based on ChaSen. Mr. Kudo is developing from the full scratch, independently of ChaSen. Although analysis accuracy does not change with ChaSen, it operates at high-speed than ChaSen.
Segmentation
Unlike English, Japanese will not put spaces between words. Plain Japanese texts will first be preprocessed so that words are segmented and spaces are put in between. This is called segmentation. (The term "segmentation" is used in the same context other than computing words)
Index(Noun)
               (Preparation)                (Search display)
                          mknmz       namazu
                         ^     |     ^      |
                         |     v     |      v
      Original Document        Index         Search Result
Namazu prepares index of words in prior to the search request, and upon request, Namazu searches the document based on the prepared index. This "prepared index" is called index. In Namazu, NMZ.* are the index.
Index (verb)
Create index explained above. Use mknmz.
Several Index
Functions to create more than 1 index and search the document in all.
Phrase searching
The basic of Namazu search is the combination of words. "foo and bar" and "bar and foo" (reverse order) are treated in the same way. Moreover, foo or bar can be found anywhere in the document. In contrast, searching string "foo bar" in this strict order is called phrase search.
namazu.conf, conf.pl
Version 1.4 or earlier, namazu and mknmz are configured in namazu.conf, conf.pl respectively. In Version 2.0, this is changed to namazurc, mknmzrc respectively.
mknmzrc (/usr/local/etc/namazu/mknmzrc)
Basic configuration for mknmz.
namazurc (/usr/local/etc/namazu/namazurc)
Configure this if you wish to change the behavior of namazu and/or namazu.cgi. You can configure Index, Replace, Logging, Lang, Template For further detail, see Manual
Perl module
In the old versions, NKF, KAKASI or ChaSen are called from Namazu as external processes. In this case, processes are invoked for each file, and the execution will be slow. In the current version, these become perl modules. By doing so, the execution speed becomes faster since no external process will be invoked.
This features are not offered in Namazu-1.3 or earlier. This is for Namazu 1.4 or later. To test if Perl modules necessary for Namazu is installed, do
perl -MText::Kakasi -e ''
perl -MText::ChaSen -e ''
perl -MMeCab -e ''
perl -MNKF -e ''
You can take advantage of Perl modules if nothing is displayed. If you then do ./configure in namazu, these Perl modules will be used.

References

KAKASI - Kanji Kana Simple Inverter
Program and Dictionary to convert Kanji-Kana sentences to Hiragana/Ro-maji sentences.
Creator: Hironobu Takahashi, Maintenance: KAKASI Project
In Namazu, KAKASI is used for Japanese segmentation.
http://kakasi.namazu.org/
Development and Distribution
http://www.namazu.org/
FAQ (Japanese)
http://www.namazu.org/FAQ.html
Namazu Mailing List
http://www.namazu.org/ml.html
Namazu Development version
http://www.namazu.org/development.html

Namazu Homepage

$Id: tutorial.html.en,v 1.33 2006/10/21 06:26:08 opengl2772 Exp $
developers@namazu.org