Skip to content

several bug fixes to twsi_evaluation.py script (see comments near changes)#2

Open
mpelevina wants to merge 1 commit intouhh-lt:masterfrom
mpelevina:master
Open

several bug fixes to twsi_evaluation.py script (see comments near changes)#2
mpelevina wants to merge 1 commit intouhh-lt:masterfrom
mpelevina:master

Conversation

@mpelevina
Copy link
Copy Markdown
Contributor

please, review my suggestions for bug fixes in the evaluation script.

Comment thread twsi_evaluation.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why python engine?

@mpelevina
Copy link
Copy Markdown
Contributor Author

The original command triggers the following error:

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  substitutions = read_csv(INVENTORY_FILE, '/\t+/', encoding='utf8', header=None)

@alexanderpanchenko
Copy link
Copy Markdown
Contributor

the proper fix would be rather to

  substitutions = read_csv(INVENTORY_FILE, separator='\t', encoding='utf8', header=None)

but we must be sure the INVENTORY_FILE has no multiple tabs (which would be actually the bug as well)

can you update and check if it works and make another pull request?

@mpelevina
Copy link
Copy Markdown
Contributor Author

This (using '\t') is indeed what I've tried at first, but using it would require rewriting parts of this script.
The thing is, the read_csv command doesn't actually separates rows into expected columns (word, sense id, substitutions/related_words). As you can see on line 103 the actual separation of a row with '\t' takes place.

word, t_id, subs = s[0].split('\t')

The same applies to all three calls of read_csv (substitutions, inventory, predictions). It did seem strange to me, but I've decided not to change it without knowing why it had been implemented like this. I think it might be brought up as an "issue".

@alexanderpanchenko
Copy link
Copy Markdown
Contributor

word, t_id, subs = s[0].split('\t')

this is really strange way to do it indeed.

it would be great if you can fix it changing the corresponding code. take your time to do so. we will need this evaluation code a lot so better to have it right, rather than monkeypatching it.

if you need to modify the format of input files go ahead to fix the problem

@mpelevina
Copy link
Copy Markdown
Contributor Author

Ok, I've put it on my todo list.

@alexanderpanchenko
Copy link
Copy Markdown
Contributor

ok. thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants