Phi-Mamba is a subquadratic, Mamba-based model distilled from Phi-1.5 using the MOHAWK method with only 3B tokens. MOHAWK allows cross-architectural distillation from Transformers by viewing Attention ...
self.A = np.random.randn(state_dim, state_dim) * 0.01 self.B = np.random.randn(state_dim, embed_dim) * 0.1 self.C = np.random.randn(vocab_size, state_dim) * 0.1 ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results