is it better x[-1] @ wte.T #25

Open

opened

on Jan 18, 2024

is it better to change
return x @ wte.T # [n_seq, n_embd] -> [n_seq, n_vocab]
by
x[-1] @ wte.T
?

then we can use
next_id = np.argmax(logits)

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests