GitHub - chahar-deepak/Information-Retrirval: Assignments done for completion of academic requirements of IR course

File object is first parsed through BeautifulSoup's html parser to remove tags. Now it is passed through nltk's word_tokenize function. It returns list of tokens Tokens are then filtered to remove non-alphabets and non_numeric tokens (stored in 'string-tokens').

Stemming is done via PorterStemmer() Lemmatization is done using WordNetLemmatizer

'plotting' function uses matplotlib library to plot most frequent 'plotting_element_count'-tokens

'getloglog' funtion takes dict of string:frequency and plots a log-log plot for Zipf's law analysis

'func' is a multipurpose funtion which internally calls above mentioned functions after calculating relevant parameters. It also calculates number of unique token to cover required X% of total tokens

'get_dict_of_bigrams' as implied, returns a dictionary of bigrams 'tuple of tokens':frequency

Chi-square is calculated and top 20 are printed

'anal_of_tokens' function is used to analyse stemming and lemmatization mistakes by manually inspecting after printing.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Assignment_2		Assignment_2
README.md		README.md
assignment1.pdf		assignment1.pdf
code.py		code.py
report.txt		report.txt
wiki_47		wiki_47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages