fitst commit, support Chinese and min_freq filtering

This commit is contained in:
Jeff Zhang 2015-07-07 12:09:34 +08:00
commit c874651412
14 changed files with 41232 additions and 0 deletions

123
Readme.md Normal file
View File

@ -0,0 +1,123 @@
# char-rnn-chinese
Based on https://github.com/karpathy/char-rnn. make the code work well with Chinese.
Make the code can process both English and Chinese characters.
This is my first touch of Lua, so the string process seems silly, but it works well.
I also add an option called 'min_freq' because the vocab size in Chinese is very big, which makes the parameter num increase a lot.
So delete some rare character may help.
-----------------------------------------------
Karpathy's Readme
This code implements **multi-layer Recurrent Neural Network** (RNN, LSTM, and GRU) for training/sampling from character-level language models. The model learns to predict the probability of the next character in a sequence. In other words, the input is a single text file and the model learns to generate text like it.
The context of this code base is described in detail in my [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). The [project page](http://cs.stanford.edu/people/karpathy/char-rnn/) that has a few pointers to some datasets.
This code was originally based on Oxford University Machine Learning class [practical 6](https://github.com/oxford-cs-ml-2015/practical6), which is in turn based on [learning to execute](https://github.com/wojciechz/learning_to_execute) code from Wojciech Zaremba. Chunks of it were also developed in collaboration with my labmate [Justin Johnson](https://github.com/jcjohnson/).
## Requirements
This code is written in Lua and requires [Torch](http://torch.ch/).
Additionally, you need to install the `nngraph` and `optim` packages using [LuaRocks](https://luarocks.org/) which you will be able to do after installing Torch:
```bash
$ luarocks install nngraph
$ luarocks install optim
```
If you'd like to use CUDA GPU computing, you'll first need to install the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit), then the `cutorch` and `cunn` packages:
```bash
$ luarocks install cutorch
$ luarocks install cunn
```
If you'd like to use OpenCL GPU computing, you'll first need to install the `cltorch` and `clnn` packages, and then use the option `-opencl 1` during training:
```bash
$ luarocks install cltorch
$ luarocks install clnn
```
## Usage
### Data
All input data is stored inside the `data/` directory. You'll notice that there is an example dataset included in the repo (in folder `data/tinyshakespeare`) which consists of a subset of works of Shakespeare. I'm providing a few more datasets on the [project page](http://cs.stanford.edu/people/karpathy/char-rnn/).
**Your own data**: If you'd like to use your own data create a single file `input.txt` and place it into a folder in `data/`. For example, `data/some_folder/input.txt`. The first time you run the training script it will write two more convenience files into `data/some_folder`.
Note that if your data is too small (1MB is already considered very small) the RNN won't learn very effectively. Remember that it has to learn everything completely from scratch.
Conversely if your data is large (more than about 2MB), feel confident to increase `rnn_size` and train a bigger model (see details of training below). It will work *significantly better*. For example with 6MB you can easily go up to `rnn_size` 300 or even more. The biggest that fits on my GPU and that I've trained with this code is `rnn_size` 700 with `num_layers` 3 (2 is default).
### Training
Start training the model using `train.lua`, for example:
```
$ th train.lua -data_dir data/some_folder -gpuid -1
```
The `-data_dir` flag is most important since it specifies the dataset to use. Notice that in this example we're also setting `gpuid` to -1 which tells the code to train using CPU, otherwise it defaults to GPU 0. There are many other flags for various options. Consult `$ th train.lua -help` for comprehensive settings. Here's another example:
```
$ th train.lua -data_dir data/some_folder -rnn_size 512 -num_layers 2 -dropout 0.5
```
While the model is training it will periodically write checkpoint files to the `cv` folder. The frequency with which these checkpoints are written is controlled with number of iterations, as specified with the `eval_val_every` option (e.g. if this is 1 then a checkpoint is written every iteration). The filename of these checkpoints contains a very imporatant number: the **loss**. For example, a checkpoint with filename `lm_lstm_epoch0.95_2.0681.t7` indicates that at this point the model was on epoch 0.95 (i.e. it has almost done one full pass over the training data), and the loss on validation data was 2.0681. This number is very important because the lower it is, the better the checkpoint works. Once you start to generate data (discussed below), you will want to use the model checkpoint that has the lowest validation loss. Notice that this might not necessarily be the last checkpoint at the end of training (due to possible overfitting).
Another important quantities to be aware of are `batch_size` (call it B), `seq_length` (call it S), and the `train_frac` and `val_frac` settings. The batch size specifies how many streams of data are processed in parallel at one time. The sequence length specifies the length of each chunk, which is also the limit at which the gradients get clipped. For example, if `seq_length` is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not *find* dependencies longer than this length in number of characters. At runtime your input text file has N characters, these first all get split into chunks of size BxS. These chunks then get allocated to three splits: train/val/test according to the `frac` settings. If your data is small, it's possible that with the default settings you'll only have very few chunks in total (for example 100). This is bad: In these cases you may want to decrease batch size or sequence length.
You can also init parameters from a previously saved checkpoint using `init_from`.
We can use these checkpoints to generate text (discussed next).
### Sampling
Given a checkpoint file (such as those written to `cv`) we can generate new text. For example:
```
$ th sample.lua cv/some_checkpoint.t7 -gpuid -1
```
Make sure that if your checkpoint was trained with GPU it is also sampled from with GPU, or vice versa. Otherwise the code will (currently) complain. As with the train script, see `$ th sample.lua -help` for full options. One important one is (for example) `-length 10000` which would generate 10,000 characters (default = 2000).
**Temperature**. An important parameter you may want to play with a lot is `-temperature`, which takes a number in range \[0, 1\] (notice 0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.
**Priming**. It's also possible to prime the model with some starting text using `-primetext`. This starts out the RNN with some hardcoded characters to *warm* it up with some context before it starts generating text.
Happy sampling!
## Tips and Tricks
### Monitoring Validation Loss vs. Training Loss
If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:
- If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
- If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)
### Approximate number of parameters
The two most important parameters that control the model are `rnn_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `rnn_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:
- The number of parameters in your model. This is printed when you start training.
- The size of your dataset. 1MB file is approximately 1 million characters.
These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:
- I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `rnn_size` larger.
- I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to increase dropout a bit.
### Best models strategy
The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.
It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.
By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.
## License
MIT

Binary file not shown.

40000
data/tinyshakespeare/input.txt Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

35
inspect_checkpoint.lua Normal file
View File

@ -0,0 +1,35 @@
-- simple script that loads a checkpoint and prints its opts
require 'torch'
require 'nn'
require 'nngraph'
require 'util.OneHot'
require 'util.misc'
cmd = torch.CmdLine()
cmd:text()
cmd:text('Load a checkpoint and print its options and validation losses.')
cmd:text()
cmd:text('Options')
cmd:argument('-model','model to load')
cmd:option('-gpuid',0,'gpu to use')
cmd:text()
-- parse input params
opt = cmd:parse(arg)
if opt.gpuid >= 0 then
print('using CUDA on GPU ' .. opt.gpuid .. '...')
require 'cutorch'
require 'cunn'
cutorch.setDevice(opt.gpuid + 1)
end
local model = torch.load(opt.model)
print('opt:')
print(model.opt)
print('val losses:')
print(model.val_losses)

52
model/GRU.lua Normal file
View File

@ -0,0 +1,52 @@
local GRU = {}
--[[
Creates one timestep of one GRU
Paper reference: http://arxiv.org/pdf/1412.3555v1.pdf
]]--
function GRU.gru(input_size, rnn_size, n)
-- there are n+1 inputs (hiddens on each layer and x)
local inputs = {}
table.insert(inputs, nn.Identity()()) -- x
for L = 1,n do
table.insert(inputs, nn.Identity()()) -- prev_h[L]
end
function new_input_sum(insize, xv, hv)
local i2h = nn.Linear(insize, rnn_size)(xv)
local h2h = nn.Linear(rnn_size, rnn_size)(hv)
return nn.CAddTable()({i2h, h2h})
end
local x, input_size_L
local outputs = {}
for L = 1,n do
local prev_h = inputs[L+1]
if L == 1 then x = inputs[1] else x = outputs[L-1] end
if L == 1 then input_size_L = input_size else input_size_L = rnn_size end
-- GRU tick
-- forward the update and reset gates
local update_gate = nn.Sigmoid()(new_input_sum(input_size_L, x, prev_h))
local reset_gate = nn.Sigmoid()(new_input_sum(input_size_L, x, prev_h))
-- compute candidate hidden state
local gated_hidden = nn.CMulTable()({reset_gate, prev_h})
local p2 = nn.Linear(rnn_size, rnn_size)(gated_hidden)
local p1 = nn.Linear(input_size_L, rnn_size)(x)
local hidden_candidate = nn.Tanh()(nn.CAddTable()({p1,p2}))
-- compute new interpolated hidden state, based on the update gate
local zh = nn.CMulTable()({update_gate, hidden_candidate})
local zhm1 = nn.CMulTable()({nn.AddConstant(1,false)(nn.MulConstant(-1,false)(update_gate)), prev_h})
local next_h = nn.CAddTable()({zh, zhm1})
table.insert(outputs, next_h)
end
return nn.gModule(inputs, outputs)
end
return GRU

65
model/LSTM.lua Normal file
View File

@ -0,0 +1,65 @@
local LSTM = {}
function LSTM.lstm(input_size, rnn_size, n, dropout)
dropout = dropout or 0
-- there will be 2*n+1 inputs
local inputs = {}
table.insert(inputs, nn.Identity()()) -- x
for L = 1,n do
table.insert(inputs, nn.Identity()()) -- prev_c[L]
table.insert(inputs, nn.Identity()()) -- prev_h[L]
end
local x, input_size_L
local outputs = {}
for L = 1,n do
-- c,h from previos timesteps
local prev_h = inputs[L*2+1]
local prev_c = inputs[L*2]
-- the input to this layer
if L == 1 then
x = OneHot(input_size)(inputs[1])
input_size_L = input_size
else
x = outputs[(L-1)*2]
if dropout > 0 then x = nn.Dropout(dropout)(x) end -- apply dropout, if any
input_size_L = rnn_size
end
-- evaluate the input sums at once for efficiency
local i2h = nn.Linear(input_size_L, 4 * rnn_size)(x)
local h2h = nn.Linear(rnn_size, 4 * rnn_size)(prev_h)
local all_input_sums = nn.CAddTable()({i2h, h2h})
-- decode the gates
local sigmoid_chunk = nn.Narrow(2, 1, 3 * rnn_size)(all_input_sums)
sigmoid_chunk = nn.Sigmoid()(sigmoid_chunk)
local in_gate = nn.Narrow(2, 1, rnn_size)(sigmoid_chunk)
local forget_gate = nn.Narrow(2, rnn_size + 1, rnn_size)(sigmoid_chunk)
local out_gate = nn.Narrow(2, 2 * rnn_size + 1, rnn_size)(sigmoid_chunk)
-- decode the write inputs
local in_transform = nn.Narrow(2, 3 * rnn_size + 1, rnn_size)(all_input_sums)
in_transform = nn.Tanh()(in_transform)
-- perform the LSTM update
local next_c = nn.CAddTable()({
nn.CMulTable()({forget_gate, prev_c}),
nn.CMulTable()({in_gate, in_transform})
})
-- gated cells form the output
local next_h = nn.CMulTable()({out_gate, nn.Tanh()(next_c)})
table.insert(outputs, next_c)
table.insert(outputs, next_h)
end
-- set up the decoder
local top_h = outputs[#outputs]
if dropout > 0 then top_h = nn.Dropout(dropout)(top_h) end
local proj = nn.Linear(rnn_size, input_size)(top_h)
local logsoft = nn.LogSoftMax()(proj)
table.insert(outputs, logsoft)
return nn.gModule(inputs, outputs)
end
return LSTM

31
model/RNN.lua Normal file
View File

@ -0,0 +1,31 @@
local RNN = {}
function RNN.rnn(input_size, rnn_size, n)
-- there are n+1 inputs (hiddens on each layer and x)
local inputs = {}
table.insert(inputs, nn.Identity()()) -- x
for L = 1,n do
table.insert(inputs, nn.Identity()()) -- prev_h[L]
end
local x, input_size_L
local outputs = {}
for L = 1,n do
local prev_h = inputs[L+1]
if L == 1 then x = inputs[1] else x = outputs[L-1] end
if L == 1 then input_size_L = input_size else input_size_L = rnn_size end
-- RNN tick
local i2h = nn.Linear(input_size_L, rnn_size)(x)
local h2h = nn.Linear(rnn_size, rnn_size)(prev_h)
local next_h = nn.Tanh()(nn.CAddTable(){i2h, h2h})
table.insert(outputs, next_h)
end
return nn.gModule(inputs, outputs)
end
return RNN

163
sample.lua Normal file
View File

@ -0,0 +1,163 @@
--[[
This file samples characters from a trained model
Code is based on implementation in
https://github.com/oxford-cs-ml-2015/practical6
]]--
require 'torch'
require 'nn'
require 'nngraph'
require 'optim'
require 'lfs'
require 'util.OneHot'
require 'util.misc'
cmd = torch.CmdLine()
cmd:text()
cmd:text('Sample from a character-level language model')
cmd:text()
cmd:text('Options')
-- required:
cmd:argument('-model','model checkpoint to use for sampling')
-- optional parameters
cmd:option('-seed',123,'random number generator\'s seed')
cmd:option('-sample',1,' 0 to use max at each timestep, 1 to sample at each timestep')
cmd:option('-primetext',"",'used as a prompt to "seed" the state of the LSTM using a given sequence, before we sample.')
cmd:option('-length',2000,'number of characters to sample')
cmd:option('-temperature',1,'temperature of sampling')
cmd:option('-gpuid',0,'which gpu to use. -1 = use CPU')
cmd:option('-verbose',1,'set to 0 to ONLY print the sampled text, no diagnostics')
cmd:text()
-- parse input params
opt = cmd:parse(arg)
-- gated print: simple utility function wrapping a print
function gprint(str)
if opt.verbose == 1 then print(str) end
end
-- check that cunn/cutorch are installed if user wants to use the GPU
if opt.gpuid >= 0 then
local ok, cunn = pcall(require, 'cunn')
local ok2, cutorch = pcall(require, 'cutorch')
if not ok then gprint('package cunn not found!') end
if not ok2 then gprint('package cutorch not found!') end
if ok and ok2 then
gprint('using CUDA on GPU ' .. opt.gpuid .. '...')
cutorch.setDevice(opt.gpuid + 1) -- note +1 to make it 0 indexed! sigh lua
cutorch.manualSeed(opt.seed)
else
gprint('Falling back on CPU mode')
opt.gpuid = -1 -- overwrite user setting
end
end
torch.manualSeed(opt.seed)
-- load the model checkpoint
if not lfs.attributes(opt.model, 'mode') then
gprint('Error: File ' .. opt.model .. ' does not exist. Are you sure you didn\'t forget to prepend cv/ ?')
end
checkpoint = torch.load(opt.model)
protos = checkpoint.protos
protos.rnn:evaluate() -- put in eval mode so that dropout works properly
-- initialize the vocabulary (and its inverted version)
local vocab = checkpoint.vocab
local ivocab = {}
for c,i in pairs(vocab) do ivocab[i] = c end
-- initialize the rnn state to all zeros
gprint('creating an LSTM...')
local current_state
local num_layers = checkpoint.opt.num_layers
current_state = {}
for L = 1,checkpoint.opt.num_layers do
-- c and h for all layers
local h_init = torch.zeros(1, checkpoint.opt.rnn_size)
if opt.gpuid >= 0 then h_init = h_init:cuda() end
table.insert(current_state, h_init:clone())
table.insert(current_state, h_init:clone())
end
state_size = #current_state
function get_char(str)
local len = #str
local left = 0
local arr = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
local unordered = {}
local start = 1
local wordLen = 0
while len ~= left do
local tmp = string.byte(str, start)
local i = #arr
while arr[i] do
if tmp >= arr[i] then
break
end
i = i - 1
end
wordLen = i + wordLen
local tmpString = string.sub(str, start, wordLen)
start = start + i
left = left + i
unordered[#unordered+1] = tmpString
end
end
-- do a few seeded timesteps
local seed_text = opt.primetext
if string.len(seed_text) > 0 then
gprint('seeding with ' .. seed_text)
gprint('--------------------------')
local chars = get_char(seed_text)
print(chars)
for i,c in ipairs(chars)'.' do
prev_char = torch.Tensor{vocab[c]}
io.write(ivocab[prev_char[1]])
if opt.gpuid >= 0 then prev_char = prev_char:cuda() end
local lst = protos.rnn:forward{prev_char, unpack(current_state)}
-- lst is a list of [state1,state2,..stateN,output]. We want everything but last piece
current_state = {}
for i=1,state_size do table.insert(current_state, lst[i]) end
prediction = lst[#lst] -- last element holds the log probabilities
end
else
-- fill with uniform probabilities over characters (? hmm)
gprint('missing seed text, using uniform probability over first character')
gprint('--------------------------')
prediction = torch.Tensor(1, #ivocab):fill(1)/(#ivocab)
if opt.gpuid >= 0 then prediction = prediction:cuda() end
end
-- start sampling/argmaxing
for i=1, opt.length do
-- log probabilities from the previous timestep
if opt.sample == 0 then
-- use argmax
local _, prev_char_ = prediction:max(2)
prev_char = prev_char_:resize(1)
else
-- use sampling
prediction:div(opt.temperature) -- scale by temperature
local probs = torch.exp(prediction):squeeze()
probs:div(torch.sum(probs)) -- renormalize so probs sum to one
prev_char = torch.multinomial(probs:float(), 1):resize(1):float()
end
-- forward the rnn for next character
local lst = protos.rnn:forward{prev_char, unpack(current_state)}
current_state = {}
for i=1,state_size do table.insert(current_state, lst[i]) end
prediction = lst[#lst] -- last element holds the log probabilities
io.write(ivocab[prev_char[1]])
end
io.write('\n') io.flush()

333
train.lua Normal file
View File

@ -0,0 +1,333 @@
--[[
This file trains a character-level multi-layer RNN on text data
Code is based on implementation in
https://github.com/oxford-cs-ml-2015/practical6
but modified to have multi-layer support, GPU support, as well as
many other common model/optimization bells and whistles.
The practical6 code is in turn based on
https://github.com/wojciechz/learning_to_execute
which is turn based on other stuff in Torch, etc... (long lineage)
]]--
require 'torch'
require 'nn'
require 'nngraph'
require 'optim'
require 'lfs'
require 'util.OneHot'
require 'util.misc'
local CharSplitLMMinibatchLoader = require 'util.CharSplitLMMinibatchLoader'
local model_utils = require 'util.model_utils'
local LSTM = require 'model.LSTM'
cmd = torch.CmdLine()
cmd:text()
cmd:text('Train a character-level language model')
cmd:text()
cmd:text('Options')
-- data
cmd:option('-data_dir','data/tinyshakespeare','data directory. Should contain the file input.txt with input data')
-- model params
cmd:option('-rnn_size', 128, 'size of LSTM internal state')
cmd:option('-num_layers', 2, 'number of layers in the LSTM')
cmd:option('-model', 'lstm', 'for now only lstm is supported. keep fixed')
-- optimization
cmd:option('-learning_rate',2e-3,'learning rate')
cmd:option('-learning_rate_decay',0.97,'learning rate decay')
cmd:option('-learning_rate_decay_after',10,'in number of epochs, when to start decaying the learning rate')
cmd:option('-decay_rate',0.95,'decay rate for rmsprop')
cmd:option('-dropout',0,'dropout for regularization, used after each RNN hidden layer. 0 = no dropout')
cmd:option('-seq_length',50,'number of timesteps to unroll for')
cmd:option('-batch_size',50,'number of sequences to train on in parallel')
cmd:option('-max_epochs',50,'number of full passes through the training data')
cmd:option('-grad_clip',5,'clip gradients at this value')
cmd:option('-train_frac',0.95,'fraction of data that goes into train set')
cmd:option('-val_frac',0.05,'fraction of data that goes into validation set')
-- test_frac will be computed as (1 - train_frac - val_frac)
cmd:option('-init_from', '', 'initialize network parameters from checkpoint at this path')
-- bookkeeping
cmd:option('-seed',123,'torch manual random number generator seed')
cmd:option('-print_every',1,'how many steps/minibatches between printing out the loss')
cmd:option('-eval_val_every',1000,'every how many iterations should we evaluate on validation data?')
cmd:option('-checkpoint_dir', 'cv', 'output directory where checkpoints get written')
cmd:option('-savefile','lstm','filename to autosave the checkpont to. Will be inside checkpoint_dir/')
-- GPU/CPU
cmd:option('-gpuid',0,'which gpu to use. -1 = use CPU')
cmd:option('-opencl',0,'use OpenCL (instead of CUDA)')
cmd:option('-min_freq',0,'min frequent of character')
cmd:text()
-- parse input params
opt = cmd:parse(arg)
torch.manualSeed(opt.seed)
-- train / val / test split for data, in fractions
local test_frac = math.max(0, 1 - (opt.train_frac + opt.val_frac))
local split_sizes = {opt.train_frac, opt.val_frac, test_frac}
-- initialize cunn/cutorch for training on the GPU and fall back to CPU gracefully
if opt.gpuid >= 0 and opt.opencl == 0 then
local ok, cunn = pcall(require, 'cunn')
local ok2, cutorch = pcall(require, 'cutorch')
if not ok then print('package cunn not found!') end
if not ok2 then print('package cutorch not found!') end
if ok and ok2 then
print('using CUDA on GPU ' .. opt.gpuid .. '...')
cutorch.setDevice(opt.gpuid + 1) -- note +1 to make it 0 indexed! sigh lua
cutorch.manualSeed(opt.seed)
else
print('If cutorch and cunn are installed, your CUDA toolkit may be improperly configured.')
print('Check your CUDA toolkit installation, rebuild cutorch and cunn, and try again.')
print('Falling back on CPU mode')
opt.gpuid = -1 -- overwrite user setting
end
end
-- initialize clnn/cltorch for training on the GPU and fall back to CPU gracefully
if opt.gpuid >= 0 and opt.opencl == 1 then
local ok, cunn = pcall(require, 'clnn')
local ok2, cutorch = pcall(require, 'cltorch')
if not ok then print('package clnn not found!') end
if not ok2 then print('package cltorch not found!') end
if ok and ok2 then
print('using OpenCL on GPU ' .. opt.gpuid .. '...')
cltorch.setDevice(opt.gpuid + 1) -- note +1 to make it 0 indexed! sigh lua
torch.manualSeed(opt.seed)
else
print('If cltorch and clnn are installed, your OpenCL driver may be improperly configured.')
print('Check your OpenCL driver installation, check output of clinfo command, and try again.')
print('Falling back on CPU mode')
opt.gpuid = -1 -- overwrite user setting
end
end
-- create the data loader class
local loader = CharSplitLMMinibatchLoader.create(opt.data_dir, opt.batch_size, opt.seq_length, split_sizes, opt.min_freq)
local vocab_size = loader.vocab_size -- the number of distinct characters
local vocab = loader.vocab_mapping
print('vocab size: ' .. vocab_size)
-- make sure output directory exists
if not path.exists(opt.checkpoint_dir) then lfs.mkdir(opt.checkpoint_dir) end
-- define the model: prototypes for one timestep, then clone them in time
local do_random_init = true
if string.len(opt.init_from) > 0 then
print('loading an LSTM from checkpoint ' .. opt.init_from)
local checkpoint = torch.load(opt.init_from)
protos = checkpoint.protos
-- make sure the vocabs are the same
local vocab_compatible = true
for c,i in pairs(checkpoint.vocab) do
if not vocab[c] == i then
vocab_compatible = false
end
end
assert(vocab_compatible, 'error, the character vocabulary for this dataset and the one in the saved checkpoint are not the same. This is trouble.')
-- overwrite model settings based on checkpoint to ensure compatibility
print('overwriting rnn_size=' .. checkpoint.opt.rnn_size .. ', num_layers=' .. checkpoint.opt.num_layers .. ' based on the checkpoint.')
opt.rnn_size = checkpoint.opt.rnn_size
opt.num_layers = checkpoint.opt.num_layers
do_random_init = false
else
print('creating an LSTM with ' .. opt.num_layers .. ' layers')
protos = {}
protos.rnn = LSTM.lstm(vocab_size, opt.rnn_size, opt.num_layers, opt.dropout)
protos.criterion = nn.ClassNLLCriterion()
end
-- the initial state of the cell/hidden states
init_state = {}
for L=1,opt.num_layers do
local h_init = torch.zeros(opt.batch_size, opt.rnn_size)
if opt.gpuid >=0 and opt.opencl == 0 then h_init = h_init:cuda() end
if opt.gpuid >=0 and opt.opencl == 1 then h_init = h_init:cl() end
table.insert(init_state, h_init:clone())
table.insert(init_state, h_init:clone())
end
-- ship the model to the GPU if desired
if opt.gpuid >= 0 and opt.opencl == 0 then
for k,v in pairs(protos) do v:cuda() end
end
if opt.gpuid >= 0 and opt.opencl == 1 then
for k,v in pairs(protos) do v:cl() end
end
-- put the above things into one flattened parameters tensor
params, grad_params = model_utils.combine_all_parameters(protos.rnn)
-- initialization
if do_random_init then
params:uniform(-0.08, 0.08) -- small numbers uniform
end
print('number of parameters in the model: ' .. params:nElement())
-- make a bunch of clones after flattening, as that reallocates memory
clones = {}
for name,proto in pairs(protos) do
print('cloning ' .. name)
clones[name] = model_utils.clone_many_times(proto, opt.seq_length, not proto.parameters)
end
-- evaluate the loss over an entire split
function eval_split(split_index, max_batches)
print('evaluating loss over split index ' .. split_index)
local n = loader.split_sizes[split_index]
if max_batches ~= nil then n = math.min(max_batches, n) end
loader:reset_batch_pointer(split_index) -- move batch iteration pointer for this split to front
local loss = 0
local rnn_state = {[0] = init_state}
for i = 1,n do -- iterate over batches in the split
-- fetch a batch
local x, y = loader:next_batch(split_index)
if opt.gpuid >= 0 and opt.opencl == 0 then -- ship the input arrays to GPU
-- have to convert to float because integers can't be cuda()'d
x = x:float():cuda()
y = y:float():cuda()
end
if opt.gpuid >= 0 and opt.opencl == 1 then -- ship the input arrays to GPU
x = x:cl()
y = y:cl()
end
-- forward pass
for t=1,opt.seq_length do
clones.rnn[t]:evaluate() -- for dropout proper functioning
local lst = clones.rnn[t]:forward{x[{{}, t}], unpack(rnn_state[t-1])}
rnn_state[t] = {}
for i=1,#init_state do table.insert(rnn_state[t], lst[i]) end
prediction = lst[#lst]
loss = loss + clones.criterion[t]:forward(prediction, y[{{}, t}])
end
-- carry over lstm state
rnn_state[0] = rnn_state[#rnn_state]
print(i .. '/' .. n .. '...')
end
loss = loss / opt.seq_length / n
return loss
end
-- do fwd/bwd and return loss, grad_params
local init_state_global = clone_list(init_state)
function feval(x)
if x ~= params then
params:copy(x)
end
grad_params:zero()
------------------ get minibatch -------------------
local x, y = loader:next_batch(1)
if opt.gpuid >= 0 and opt.opencl == 0 then -- ship the input arrays to GPU
-- have to convert to float because integers can't be cuda()'d
x = x:float():cuda()
y = y:float():cuda()
end
if opt.gpuid >= 0 and opt.opencl == 1 then -- ship the input arrays to GPU
x = x:cl()
y = y:cl()
end
------------------- forward pass -------------------
local rnn_state = {[0] = init_state_global}
local predictions = {} -- softmax outputs
local loss = 0
for t=1,opt.seq_length do
clones.rnn[t]:training() -- make sure we are in correct mode (this is cheap, sets flag)
local lst = clones.rnn[t]:forward{x[{{}, t}], unpack(rnn_state[t-1])}
rnn_state[t] = {}
for i=1,#init_state do table.insert(rnn_state[t], lst[i]) end -- extract the state, without output
predictions[t] = lst[#lst] -- last element is the prediction
loss = loss + clones.criterion[t]:forward(predictions[t], y[{{}, t}])
end
loss = loss / opt.seq_length
------------------ backward pass -------------------
-- initialize gradient at time t to be zeros (there's no influence from future)
local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones
for t=opt.seq_length,1,-1 do
-- backprop through loss, and softmax/linear
local doutput_t = clones.criterion[t]:backward(predictions[t], y[{{}, t}])
table.insert(drnn_state[t], doutput_t)
local dlst = clones.rnn[t]:backward({x[{{}, t}], unpack(rnn_state[t-1])}, drnn_state[t])
drnn_state[t-1] = {}
for k,v in pairs(dlst) do
if k > 1 then -- k == 1 is gradient on x, which we dont need
-- note we do k-1 because first item is dembeddings, and then follow the
-- derivatives of the state, starting at index 2. I know...
drnn_state[t-1][k-1] = v
end
end
end
------------------------ misc ----------------------
-- transfer final state to initial state (BPTT)
init_state_global = rnn_state[#rnn_state] -- NOTE: I don't think this needs to be a clone, right?
-- clip gradient element-wise
grad_params:clamp(-opt.grad_clip, opt.grad_clip)
return loss, grad_params
end
-- start optimization here
train_losses = {}
val_losses = {}
local optim_state = {learningRate = opt.learning_rate, alpha = opt.decay_rate}
local iterations = opt.max_epochs * loader.ntrain
local iterations_per_epoch = loader.ntrain
local loss0 = nil
for i = 1, iterations do
local epoch = i / loader.ntrain
local timer = torch.Timer()
local _, loss = optim.rmsprop(feval, params, optim_state)
local time = timer:time().real
local train_loss = loss[1] -- the loss is inside a list, pop it
train_losses[i] = train_loss
-- exponential learning rate decay
if i % loader.ntrain == 0 and opt.learning_rate_decay < 1 then
if epoch >= opt.learning_rate_decay_after then
local decay_factor = opt.learning_rate_decay
optim_state.learningRate = optim_state.learningRate * decay_factor -- decay it
print('decayed learning rate by a factor ' .. decay_factor .. ' to ' .. optim_state.learningRate)
end
end
-- every now and then or on last iteration
if i % opt.eval_val_every == 0 or i == iterations then
-- evaluate loss on validation data
local val_loss = eval_split(2) -- 2 = validation
val_losses[i] = val_loss
local savefile = string.format('%s/lm_%s_epoch%.2f_%.4f.t7', opt.checkpoint_dir, opt.savefile, epoch, val_loss)
print('saving checkpoint to ' .. savefile)
local checkpoint = {}
checkpoint.protos = protos
checkpoint.opt = opt
checkpoint.train_losses = train_losses
checkpoint.val_loss = val_loss
checkpoint.val_losses = val_losses
checkpoint.i = i
checkpoint.epoch = epoch
checkpoint.vocab = loader.vocab_mapping
torch.save(savefile, checkpoint)
end
if i % opt.print_every == 0 then
print(string.format("%d/%d (epoch %.3f), train_loss = %6.8f, grad/param norm = %6.4e, time/batch = %.2fs", i, iterations, epoch, train_loss, grad_params:norm() / params:norm(), time))
end
if i % 10 == 0 then collectgarbage() end
-- handle early stopping if things are going really bad
if loss0 == nil then loss0 = loss[1] end
if loss[1] > loss0 * 3 then
print('loss is exploding, aborting.')
break -- halt
end
end

View File

@ -0,0 +1,235 @@
-- Modified from https://github.com/oxford-cs-ml-2015/practical6
-- the modification included support for train/val/test splits
local CharSplitLMMinibatchLoader = {}
CharSplitLMMinibatchLoader.__index = CharSplitLMMinibatchLoader
function CharSplitLMMinibatchLoader.create(data_dir, batch_size, seq_length, split_fractions, min_freq)
-- split_fractions is e.g. {0.9, 0.05, 0.05}
local self = {}
setmetatable(self, CharSplitLMMinibatchLoader)
local input_file = path.join(data_dir, 'input.txt')
local vocab_file = path.join(data_dir, 'vocab.t7')
local tensor_file = path.join(data_dir, 'data.t7')
-- fetch file attributes to determine if we need to rerun preprocessing
local run_prepro = false
if not (path.exists(vocab_file) or path.exists(tensor_file)) then
-- prepro files do not exist, generate them
print('vocab.t7 and data.t7 do not exist. Running preprocessing...')
run_prepro = true
else
-- check if the input file was modified since last time we
-- ran the prepro. if so, we have to rerun the preprocessing
local input_attr = lfs.attributes(input_file)
local vocab_attr = lfs.attributes(vocab_file)
local tensor_attr = lfs.attributes(tensor_file)
if input_attr.modification > vocab_attr.modification or input_attr.modification > tensor_attr.modification then
print('vocab.t7 or data.t7 detected as stale. Re-running preprocessing...')
run_prepro = true
end
end
if run_prepro then
-- construct a tensor with all the data, and vocab file
print('one-time setup: preprocessing input text file ' .. input_file .. '...')
CharSplitLMMinibatchLoader.text_to_tensor(input_file, vocab_file, tensor_file, min_freq)
end
print('loading data files...')
local data = torch.load(tensor_file)
self.vocab_mapping = torch.load(vocab_file)
-- cut off the end so that it divides evenly
local len = data:size(1)
if len % (batch_size * seq_length) ~= 0 then
print('cutting off end of data so that the batches/sequences divide evenly')
data = data:sub(1, batch_size * seq_length
* math.floor(len / (batch_size * seq_length)))
end
-- count vocab
self.vocab_size = 0
for _ in pairs(self.vocab_mapping) do
self.vocab_size = self.vocab_size + 1
end
-- self.batches is a table of tensors
print('reshaping tensor...')
self.batch_size = batch_size
self.seq_length = seq_length
local ydata = data:clone()
ydata:sub(1,-2):copy(data:sub(2,-1))
ydata[-1] = data[1]
self.x_batches = data:view(batch_size, -1):split(seq_length, 2) -- #rows = #batches
self.nbatches = #self.x_batches
self.y_batches = ydata:view(batch_size, -1):split(seq_length, 2) -- #rows = #batches
assert(#self.x_batches == #self.y_batches)
-- lets try to be helpful here
if self.nbatches < 50 then
print('WARNING: less than 50 batches in the data in total? Looks like very small dataset. You probably want to use smaller batch_size and/or seq_length.')
end
-- perform safety checks on split_fractions
assert(split_fractions[1] >= 0 and split_fractions[1] <= 1, 'bad split fraction ' .. split_fractions[1] .. ' for train, not between 0 and 1')
assert(split_fractions[2] >= 0 and split_fractions[2] <= 1, 'bad split fraction ' .. split_fractions[2] .. ' for val, not between 0 and 1')
assert(split_fractions[3] >= 0 and split_fractions[3] <= 1, 'bad split fraction ' .. split_fractions[3] .. ' for test, not between 0 and 1')
if split_fractions[3] == 0 then
-- catch a common special case where the user might not want a test set
self.ntrain = math.floor(self.nbatches * split_fractions[1])
self.nval = self.nbatches - self.ntrain
self.ntest = 0
else
-- divide data to train/val and allocate rest to test
self.ntrain = math.floor(self.nbatches * split_fractions[1])
self.nval = math.floor(self.nbatches * split_fractions[2])
self.ntest = self.nbatches - self.nval - self.ntrain -- the rest goes to test (to ensure this adds up exactly)
end
self.split_sizes = {self.ntrain, self.nval, self.ntest}
self.batch_ix = {0,0,0}
print(string.format('data load done. Number of data batches in train: %d, val: %d, test: %d', self.ntrain, self.nval, self.ntest))
collectgarbage()
return self
end
function CharSplitLMMinibatchLoader:reset_batch_pointer(split_index, batch_index)
batch_index = batch_index or 0
self.batch_ix[split_index] = batch_index
end
function CharSplitLMMinibatchLoader:next_batch(split_index)
if self.split_sizes[split_index] == 0 then
-- perform a check here to make sure the user isn't screwing something up
local split_names = {'train', 'val', 'test'}
print('ERROR. Code requested a batch for split ' .. split_names[split_index] .. ', but this split has no data.')
os.exit() -- crash violently
end
-- split_index is integer: 1 = train, 2 = val, 3 = test
self.batch_ix[split_index] = self.batch_ix[split_index] + 1
if self.batch_ix[split_index] > self.split_sizes[split_index] then
self.batch_ix[split_index] = 1 -- cycle around to beginning
end
-- pull out the correct next batch
local ix = self.batch_ix[split_index]
if split_index == 2 then ix = ix + self.ntrain end -- offset by train set size
if split_index == 3 then ix = ix + self.ntrain + self.nval end -- offset by train + val
return self.x_batches[ix], self.y_batches[ix]
end
-- chinese vocab
function get_vocab(str, min_freq)
local len = #str
local left = 0
local arr = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
local unordered = {}
local start = 1
local wordLen = 0
g_total_chars = 0
while len ~= left do
local tmp = string.byte(str, start)
local i = #arr
while arr[i] do
if tmp >= arr[i] then
break
end
i = i - 1
end
wordLen = i + wordLen
local tmpString = string.sub(str, start, wordLen)
start = start + i
left = left + i
if not unordered[tmpString] then
unordered[tmpString] = 1
else
unordered[tmpString] = unordered[tmpString] + 1 end
g_total_chars = g_total_chars + 1
end
final_res = {}
for char_val, char_cnt in pairs(unordered) do
if char_cnt >= min_freq then
final_res[char_val] = true
end
end
return final_res
end
-- change raw data to tokens
function get_data(str, vocab_mapping)
-- can not use torch.ByteTensor because it support mo more than 256
-- local data = torch.ByteTensor(g_total_chars) -- store it into 1D first, then rearrange
local data = torch.ShortTensor(g_total_chars)
local len = #str
local left = 0
local arr = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
local start = 1
local wordLen = 0
local count = 1
while len ~= left do
local tmp = string.byte(str, start)
local i = #arr
while arr[i] do
if tmp >= arr[i] then
break
end
i = i - 1
end
wordLen = i + wordLen
local tmpString = string.sub(str, start, wordLen)
start = start + i
left = left + i
if vocab_mapping[tmpString] then
data[count] = vocab_mapping[tmpString]
else
data[count] = vocab_mapping['UNKNOW']
end
count = count + 1
end
return data
end
-- *** STATIC method ***
function CharSplitLMMinibatchLoader.text_to_tensor(in_textfile, out_vocabfile, out_tensorfile, min_freq)
local timer = torch.Timer()
print('loading text file...')
local f = torch.DiskFile(in_textfile)
local rawdata = f:readString('*a') -- NOTE: this reads the whole file at once
f:close()
-- create vocabulary if it doesn't exist yet
print('creating vocabulary mapping...')
-- record all characters to a set
local unordered = get_vocab(rawdata, min_freq)
-- sort into a table (i.e. keys become 1..N)
local ordered = {}
for char in pairs(unordered) do ordered[#ordered + 1] = char end
table.sort(ordered)
-- invert `ordered` to create the char->int mapping
local vocab_mapping = {}
count_vocab = 0
for i, char in ipairs(ordered) do
vocab_mapping[char] = i
count_vocab = count_vocab + 1
end
vocab_mapping['UNKNOW'] = count_vocab + 1
-- construct a tensor with all the data
print('putting data into tensor, it takes a lot of time...')
local data = get_data(rawdata, vocab_mapping)
-- save output preprocessed files
print('saving ' .. out_vocabfile)
torch.save(out_vocabfile, vocab_mapping)
print('saving ' .. out_tensorfile)
torch.save(out_tensorfile, data)
end
return CharSplitLMMinibatchLoader

20
util/OneHot.lua Normal file
View File

@ -0,0 +1,20 @@
local OneHot, parent = torch.class('OneHot', 'nn.Module')
function OneHot:__init(outputSize)
parent.__init(self)
self.outputSize = outputSize
-- We'll construct one-hot encodings by using the index method to
-- reshuffle the rows of an identity matrix. To avoid recreating
-- it every iteration we'll cache it.
self._eye = torch.eye(outputSize)
end
function OneHot:updateOutput(input)
self.output:resize(input:size(1), self.outputSize):zero()
if self._eye == nil then self._eye = torch.eye(self.outputSize) end
self._eye = self._eye:float()
local longInput = input:long()
self.output:copy(self._eye:index(1, longInput))
return self.output
end

13
util/misc.lua Normal file
View File

@ -0,0 +1,13 @@
-- misc utilities
function clone_list(tensor_list, zero_too)
-- utility function. todo: move away to some utils file?
-- takes a list of tensors and returns a list of cloned tensors
local out = {}
for k,v in pairs(tensor_list) do
out[k] = v:clone()
if zero_too then out[k]:zero() end
end
return out
end

162
util/model_utils.lua Normal file
View File

@ -0,0 +1,162 @@
-- adapted from https://github.com/wojciechz/learning_to_execute
-- utilities for combining/flattening parameters in a model
-- the code in this script is more general than it needs to be, which is
-- why it is kind of a large
require 'torch'
local model_utils = {}
function model_utils.combine_all_parameters(...)
--[[ like module:getParameters, but operates on many modules ]]--
-- get parameters
local networks = {...}
local parameters = {}
local gradParameters = {}
for i = 1, #networks do
local net_params, net_grads = networks[i]:parameters()
if net_params then
for _, p in pairs(net_params) do
parameters[#parameters + 1] = p
end
for _, g in pairs(net_grads) do
gradParameters[#gradParameters + 1] = g
end
end
end
local function storageInSet(set, storage)
local storageAndOffset = set[torch.pointer(storage)]
if storageAndOffset == nil then
return nil
end
local _, offset = unpack(storageAndOffset)
return offset
end
-- this function flattens arbitrary lists of parameters,
-- even complex shared ones
local function flatten(parameters)
if not parameters or #parameters == 0 then
return torch.Tensor()
end
local Tensor = parameters[1].new
local storages = {}
local nParameters = 0
for k = 1,#parameters do
local storage = parameters[k]:storage()
if not storageInSet(storages, storage) then
storages[torch.pointer(storage)] = {storage, nParameters}
nParameters = nParameters + storage:size()
end
end
local flatParameters = Tensor(nParameters):fill(1)
local flatStorage = flatParameters:storage()
for k = 1,#parameters do
local storageOffset = storageInSet(storages, parameters[k]:storage())
parameters[k]:set(flatStorage,
storageOffset + parameters[k]:storageOffset(),
parameters[k]:size(),
parameters[k]:stride())
parameters[k]:zero()
end
local maskParameters= flatParameters:float():clone()
local cumSumOfHoles = flatParameters:float():cumsum(1)
local nUsedParameters = nParameters - cumSumOfHoles[#cumSumOfHoles]
local flatUsedParameters = Tensor(nUsedParameters)
local flatUsedStorage = flatUsedParameters:storage()
for k = 1,#parameters do
local offset = cumSumOfHoles[parameters[k]:storageOffset()]
parameters[k]:set(flatUsedStorage,
parameters[k]:storageOffset() - offset,
parameters[k]:size(),
parameters[k]:stride())
end
for _, storageAndOffset in pairs(storages) do
local k, v = unpack(storageAndOffset)
flatParameters[{{v+1,v+k:size()}}]:copy(Tensor():set(k))
end
if cumSumOfHoles:sum() == 0 then
flatUsedParameters:copy(flatParameters)
else
local counter = 0
for k = 1,flatParameters:nElement() do
if maskParameters[k] == 0 then
counter = counter + 1
flatUsedParameters[counter] = flatParameters[counter+cumSumOfHoles[k]]
end
end
assert (counter == nUsedParameters)
end
return flatUsedParameters
end
-- flatten parameters and gradients
local flatParameters = flatten(parameters)
local flatGradParameters = flatten(gradParameters)
-- return new flat vector that contains all discrete parameters
return flatParameters, flatGradParameters
end
function model_utils.clone_many_times(net, T)
local clones = {}
local params, gradParams
if net.parameters then
params, gradParams = net:parameters()
if params == nil then
params = {}
end
end
local paramsNoGrad
if net.parametersNoGrad then
paramsNoGrad = net:parametersNoGrad()
end
local mem = torch.MemoryFile("w"):binary()
mem:writeObject(net)
for t = 1, T do
-- We need to use a new reader for each clone.
-- We don't want to use the pointers to already read objects.
local reader = torch.MemoryFile(mem:storage(), "r"):binary()
local clone = reader:readObject()
reader:close()
if net.parameters then
local cloneParams, cloneGradParams = clone:parameters()
local cloneParamsNoGrad
for i = 1, #params do
cloneParams[i]:set(params[i])
cloneGradParams[i]:set(gradParams[i])
end
if paramsNoGrad then
cloneParamsNoGrad = clone:parametersNoGrad()
for i =1,#paramsNoGrad do
cloneParamsNoGrad[i]:set(paramsNoGrad[i])
end
end
end
clones[t] = clone
collectgarbage()
end
mem:close()
return clones
end
return model_utils