fitst commit, support Chinese and min_freq filtering

2015-07-07 12:09:34 +08:00 · 2015-07-07 12:09:34 +08:00 · c874651412
commit c874651412
14 changed files with 41232 additions and 0 deletions
--- a/Readme.md
+++ b/Readme.md
@ -0,0 +1,123 @@
+
+# char-rnn-chinese
+Based on https://github.com/karpathy/char-rnn. make the code work well with Chinese.
+
+Make the code can process both English and Chinese characters.
+This is my first touch of Lua, so the string process seems silly, but it works well.
+
+I also add an option called 'min_freq' because the vocab size in Chinese is very big, which makes the parameter num increase a lot.
+So delete some rare character may help.
+
+-----------------------------------------------
+Karpathy's Readme
+This code implements **multi-layer Recurrent Neural Network** (RNN, LSTM, and GRU) for training/sampling from character-level language models. The model learns to predict the probability of the next character in a sequence. In other words, the input is a single text file and the model learns to generate text like it.
+
+The context of this code base is described in detail in my [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). The [project page](http://cs.stanford.edu/people/karpathy/char-rnn/) that has a few pointers to some datasets.
+
+This code was originally based on Oxford University Machine Learning class [practical 6](https://github.com/oxford-cs-ml-2015/practical6), which is in turn based on [learning to execute](https://github.com/wojciechz/learning_to_execute) code from Wojciech Zaremba. Chunks of it were also developed in collaboration with my labmate [Justin Johnson](https://github.com/jcjohnson/).
+
+## Requirements
+
+This code is written in Lua and requires [Torch](http://torch.ch/).
+Additionally, you need to install the `nngraph` and `optim` packages using [LuaRocks](https://luarocks.org/) which you will be able to do after installing Torch:
+
+```bash
+$ luarocks install nngraph 
+$ luarocks install optim
+```
+
+If you'd like to use CUDA GPU computing, you'll first need to install the [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit), then the `cutorch` and `cunn` packages:
+
+```bash
+$ luarocks install cutorch
+$ luarocks install cunn
+```
+
+If you'd like to use OpenCL GPU computing, you'll first need to install the `cltorch` and `clnn` packages, and then use the option `-opencl 1` during training:
+
+```bash
+$ luarocks install cltorch
+$ luarocks install clnn
+```
+
+## Usage
+
+### Data
+
+All input data is stored inside the `data/` directory. You'll notice that there is an example dataset included in the repo (in folder `data/tinyshakespeare`) which consists of a subset of works of Shakespeare. I'm providing a few more datasets on the [project page](http://cs.stanford.edu/people/karpathy/char-rnn/).
+
+**Your own data**: If you'd like to use your own data create a single file `input.txt` and place it into a folder in `data/`. For example, `data/some_folder/input.txt`. The first time you run the training script it will write two more convenience files into `data/some_folder`.
+
+Note that if your data is too small (1MB is already considered very small) the RNN won't learn very effectively. Remember that it has to learn everything completely from scratch.
+
+Conversely if your data is large (more than about 2MB), feel confident to increase `rnn_size` and train a bigger model (see details of training below). It will work *significantly better*. For example with 6MB you can easily go up to `rnn_size` 300 or even more. The biggest that fits on my GPU and that I've trained with this code is `rnn_size` 700 with `num_layers` 3 (2 is default).
+
+### Training
+
+Start training the model using `train.lua`, for example:
+
+```
+$ th train.lua -data_dir data/some_folder -gpuid -1
+```
+
+The `-data_dir` flag is most important since it specifies the dataset to use. Notice that in this example we're also setting `gpuid` to -1 which tells the code to train using CPU, otherwise it defaults to GPU 0. There are many other flags for various options. Consult `$ th train.lua -help` for comprehensive settings. Here's another example:
+
+```
+$ th train.lua -data_dir data/some_folder -rnn_size 512 -num_layers 2 -dropout 0.5
+```
+
+While the model is training it will periodically write checkpoint files to the `cv` folder. The frequency with which these checkpoints are written is controlled with number of iterations, as specified with the `eval_val_every` option (e.g. if this is 1 then a checkpoint is written every iteration). The filename of these checkpoints contains a very imporatant number: the **loss**. For example, a checkpoint with filename `lm_lstm_epoch0.95_2.0681.t7` indicates that at this point the model was on epoch 0.95 (i.e. it has almost done one full pass over the training data), and the loss on validation data was 2.0681. This number is very important because the lower it is, the better the checkpoint works. Once you start to generate data (discussed below), you will want to use the model checkpoint that has the lowest validation loss. Notice that this might not necessarily be the last checkpoint at the end of training (due to possible overfitting).
+
+Another important quantities to be aware of are `batch_size` (call it B), `seq_length` (call it S), and the `train_frac` and `val_frac` settings. The batch size specifies how many streams of data are processed in parallel at one time. The sequence length specifies the length of each chunk, which is also the limit at which the gradients get clipped. For example, if `seq_length` is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not *find* dependencies longer than this length in number of characters. At runtime your input text file has N characters, these first all get split into chunks of size BxS. These chunks then get allocated to three splits: train/val/test according to the `frac` settings. If your data is small, it's possible that with the default settings you'll only have very few chunks in total (for example 100). This is bad: In these cases you may want to decrease batch size or sequence length.
+
+You can also init parameters from a previously saved checkpoint using `init_from`.
+
+We can use these checkpoints to generate text (discussed next).
+
+### Sampling
+
+Given a checkpoint file (such as those written to `cv`) we can generate new text. For example:
+
+```
+$ th sample.lua cv/some_checkpoint.t7 -gpuid -1
+```
+
+Make sure that if your checkpoint was trained with GPU it is also sampled from with GPU, or vice versa. Otherwise the code will (currently) complain. As with the train script, see `$ th sample.lua -help` for full options. One important one is (for example) `-length 10000` which would generate 10,000 characters (default = 2000).
+
+**Temperature**. An important parameter you may want to play with a lot is `-temperature`, which takes a number in range \[0, 1\] (notice 0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.
+
+**Priming**. It's also possible to prime the model with some starting text using `-primetext`. This starts out the RNN with some hardcoded characters to *warm* it up with some context before it starts generating text.
+
+Happy sampling!
+
+## Tips and Tricks
+
+### Monitoring Validation Loss vs. Training Loss
+If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:
+
+- If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
+- If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)
+
+### Approximate number of parameters
+
+The two most important parameters that control the model are `rnn_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `rnn_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:
+
+- The number of parameters in your model. This is printed when you start training.
+- The size of your dataset. 1MB file is approximately 1 million characters.
+
+These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:
+
+- I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `rnn_size` larger.
+- I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to increase dropout a bit.
+
+### Best models strategy
+
+The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.
+
+It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.
+
+By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.
+
+## License
+
+MIT
--- a/data/tinyshakespeare/data.t7
+++ b/data/tinyshakespeare/data.t7
--- a/data/tinyshakespeare/input.txt
+++ b/data/tinyshakespeare/input.txt
--- a/data/tinyshakespeare/vocab.t7
+++ b/data/tinyshakespeare/vocab.t7
--- a/inspect_checkpoint.lua
+++ b/inspect_checkpoint.lua
@ -0,0 +1,35 @@
+-- simple script that loads a checkpoint and prints its opts
+
+require 'torch'
+require 'nn'
+require 'nngraph'
+
+require 'util.OneHot'
+require 'util.misc'
+
+cmd = torch.CmdLine()
+cmd:text()
+cmd:text('Load a checkpoint and print its options and validation losses.')
+cmd:text()
+cmd:text('Options')
+cmd:argument('-model','model to load')
+cmd:option('-gpuid',0,'gpu to use')
+cmd:text()
+
+-- parse input params
+opt = cmd:parse(arg)
+
+if opt.gpuid >= 0 then
+    print('using CUDA on GPU ' .. opt.gpuid .. '...')
+    require 'cutorch'
+    require 'cunn'
+    cutorch.setDevice(opt.gpuid + 1)
+end
+
+local model = torch.load(opt.model)
+
+print('opt:')
+print(model.opt)
+print('val losses:')
+print(model.val_losses)
+
--- a/model/GRU.lua
+++ b/model/GRU.lua
@ -0,0 +1,52 @@
+
+local GRU = {}
+
+--[[
+Creates one timestep of one GRU
+Paper reference: http://arxiv.org/pdf/1412.3555v1.pdf
+]]--
+function GRU.gru(input_size, rnn_size, n)
+  
+  -- there are n+1 inputs (hiddens on each layer and x)
+  local inputs = {}
+  table.insert(inputs, nn.Identity()()) -- x
+  for L = 1,n do
+    table.insert(inputs, nn.Identity()()) -- prev_h[L]
+  end
+
+  function new_input_sum(insize, xv, hv)
+    local i2h = nn.Linear(insize, rnn_size)(xv)
+    local h2h = nn.Linear(rnn_size, rnn_size)(hv)
+    return nn.CAddTable()({i2h, h2h})
+  end
+
+  local x, input_size_L
+  local outputs = {}
+  for L = 1,n do
+
+    local prev_h = inputs[L+1]
+    if L == 1 then x = inputs[1] else x = outputs[L-1] end
+    if L == 1 then input_size_L = input_size else input_size_L = rnn_size end
+
+    -- GRU tick
+    -- forward the update and reset gates
+    local update_gate = nn.Sigmoid()(new_input_sum(input_size_L, x, prev_h))
+    local reset_gate = nn.Sigmoid()(new_input_sum(input_size_L, x, prev_h))
+    -- compute candidate hidden state
+    local gated_hidden = nn.CMulTable()({reset_gate, prev_h})
+    local p2 = nn.Linear(rnn_size, rnn_size)(gated_hidden)
+    local p1 = nn.Linear(input_size_L, rnn_size)(x)
+    local hidden_candidate = nn.Tanh()(nn.CAddTable()({p1,p2}))
+    -- compute new interpolated hidden state, based on the update gate
+    local zh = nn.CMulTable()({update_gate, hidden_candidate})
+    local zhm1 = nn.CMulTable()({nn.AddConstant(1,false)(nn.MulConstant(-1,false)(update_gate)), prev_h})
+    local next_h = nn.CAddTable()({zh, zhm1})
+
+    table.insert(outputs, next_h)
+  end
+
+  return nn.gModule(inputs, outputs)
+end
+
+return GRU
+
--- a/model/LSTM.lua
+++ b/model/LSTM.lua
@ -0,0 +1,65 @@
+
+local LSTM = {}
+function LSTM.lstm(input_size, rnn_size, n, dropout)
+  dropout = dropout or 0 
+
+  -- there will be 2*n+1 inputs
+  local inputs = {}
+  table.insert(inputs, nn.Identity()()) -- x
+  for L = 1,n do
+    table.insert(inputs, nn.Identity()()) -- prev_c[L]
+    table.insert(inputs, nn.Identity()()) -- prev_h[L]
+  end
+
+  local x, input_size_L
+  local outputs = {}
+  for L = 1,n do
+    -- c,h from previos timesteps
+    local prev_h = inputs[L*2+1]
+    local prev_c = inputs[L*2]
+    -- the input to this layer
+    if L == 1 then 
+      x = OneHot(input_size)(inputs[1])
+      input_size_L = input_size
+    else 
+      x = outputs[(L-1)*2] 
+      if dropout > 0 then x = nn.Dropout(dropout)(x) end -- apply dropout, if any
+      input_size_L = rnn_size
+    end
+    -- evaluate the input sums at once for efficiency
+    local i2h = nn.Linear(input_size_L, 4 * rnn_size)(x)
+    local h2h = nn.Linear(rnn_size, 4 * rnn_size)(prev_h)
+    local all_input_sums = nn.CAddTable()({i2h, h2h})
+    -- decode the gates
+    local sigmoid_chunk = nn.Narrow(2, 1, 3 * rnn_size)(all_input_sums)
+    sigmoid_chunk = nn.Sigmoid()(sigmoid_chunk)
+    local in_gate = nn.Narrow(2, 1, rnn_size)(sigmoid_chunk)
+    local forget_gate = nn.Narrow(2, rnn_size + 1, rnn_size)(sigmoid_chunk)
+    local out_gate = nn.Narrow(2, 2 * rnn_size + 1, rnn_size)(sigmoid_chunk)
+    -- decode the write inputs
+    local in_transform = nn.Narrow(2, 3 * rnn_size + 1, rnn_size)(all_input_sums)
+    in_transform = nn.Tanh()(in_transform)
+    -- perform the LSTM update
+    local next_c           = nn.CAddTable()({
+        nn.CMulTable()({forget_gate, prev_c}),
+        nn.CMulTable()({in_gate,     in_transform})
+      })
+    -- gated cells form the output
+    local next_h = nn.CMulTable()({out_gate, nn.Tanh()(next_c)})
+    
+    table.insert(outputs, next_c)
+    table.insert(outputs, next_h)
+  end
+
+  -- set up the decoder
+  local top_h = outputs[#outputs]
+  if dropout > 0 then top_h = nn.Dropout(dropout)(top_h) end
+  local proj = nn.Linear(rnn_size, input_size)(top_h)
+  local logsoft = nn.LogSoftMax()(proj)
+  table.insert(outputs, logsoft)
+
+  return nn.gModule(inputs, outputs)
+end
+
+return LSTM
+
--- a/model/RNN.lua
+++ b/model/RNN.lua
@ -0,0 +1,31 @@
+local RNN = {}
+
+function RNN.rnn(input_size, rnn_size, n)
+  
+  -- there are n+1 inputs (hiddens on each layer and x)
+  local inputs = {}
+  table.insert(inputs, nn.Identity()()) -- x
+  for L = 1,n do
+    table.insert(inputs, nn.Identity()()) -- prev_h[L]
+  end
+
+  local x, input_size_L
+  local outputs = {}
+  for L = 1,n do
+    
+    local prev_h = inputs[L+1]
+    if L == 1 then x = inputs[1] else x = outputs[L-1] end
+    if L == 1 then input_size_L = input_size else input_size_L = rnn_size end
+
+    -- RNN tick
+    local i2h = nn.Linear(input_size_L, rnn_size)(x)
+    local h2h = nn.Linear(rnn_size, rnn_size)(prev_h)
+    local next_h = nn.Tanh()(nn.CAddTable(){i2h, h2h})
+
+    table.insert(outputs, next_h)
+  end
+
+  return nn.gModule(inputs, outputs)
+end
+
+return RNN
--- a/sample.lua
+++ b/sample.lua
@ -0,0 +1,163 @@
+
+--[[
+
+This file samples characters from a trained model
+
+Code is based on implementation in 
+https://github.com/oxford-cs-ml-2015/practical6
+
+]]--
+
+require 'torch'
+require 'nn'
+require 'nngraph'
+require 'optim'
+require 'lfs'
+
+require 'util.OneHot'
+require 'util.misc'
+
+cmd = torch.CmdLine()
+cmd:text()
+cmd:text('Sample from a character-level language model')
+cmd:text()
+cmd:text('Options')
+-- required:
+cmd:argument('-model','model checkpoint to use for sampling')
+-- optional parameters
+cmd:option('-seed',123,'random number generator\'s seed')
+cmd:option('-sample',1,' 0 to use max at each timestep, 1 to sample at each timestep')
+cmd:option('-primetext',"",'used as a prompt to "seed" the state of the LSTM using a given sequence, before we sample.')
+cmd:option('-length',2000,'number of characters to sample')
+cmd:option('-temperature',1,'temperature of sampling')
+cmd:option('-gpuid',0,'which gpu to use. -1 = use CPU')
+cmd:option('-verbose',1,'set to 0 to ONLY print the sampled text, no diagnostics')
+cmd:text()
+
+-- parse input params
+opt = cmd:parse(arg)
+
+-- gated print: simple utility function wrapping a print
+function gprint(str)
+    if opt.verbose == 1 then print(str) end
+end
+
+-- check that cunn/cutorch are installed if user wants to use the GPU
+if opt.gpuid >= 0 then
+    local ok, cunn = pcall(require, 'cunn')
+    local ok2, cutorch = pcall(require, 'cutorch')
+    if not ok then gprint('package cunn not found!') end
+    if not ok2 then gprint('package cutorch not found!') end
+    if ok and ok2 then
+        gprint('using CUDA on GPU ' .. opt.gpuid .. '...')
+        cutorch.setDevice(opt.gpuid + 1) -- note +1 to make it 0 indexed! sigh lua
+        cutorch.manualSeed(opt.seed)
+    else
+        gprint('Falling back on CPU mode')
+        opt.gpuid = -1 -- overwrite user setting
+    end
+end
+torch.manualSeed(opt.seed)
+
+-- load the model checkpoint
+if not lfs.attributes(opt.model, 'mode') then
+    gprint('Error: File ' .. opt.model .. ' does not exist. Are you sure you didn\'t forget to prepend cv/ ?')
+end
+checkpoint = torch.load(opt.model)
+protos = checkpoint.protos
+protos.rnn:evaluate() -- put in eval mode so that dropout works properly
+
+-- initialize the vocabulary (and its inverted version)
+local vocab = checkpoint.vocab
+local ivocab = {}
+for c,i in pairs(vocab) do ivocab[i] = c end
+
+-- initialize the rnn state to all zeros
+gprint('creating an LSTM...')
+local current_state
+local num_layers = checkpoint.opt.num_layers
+current_state = {}
+for L = 1,checkpoint.opt.num_layers do
+    -- c and h for all layers
+    local h_init = torch.zeros(1, checkpoint.opt.rnn_size)
+    if opt.gpuid >= 0 then h_init = h_init:cuda() end
+    table.insert(current_state, h_init:clone())
+    table.insert(current_state, h_init:clone())
+end
+state_size = #current_state
+
+function get_char(str)
+    local len  = #str
+    local left = 0
+    local arr  = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
+    local unordered = {}
+    local start = 1
+    local wordLen = 0
+    while len ~= left do
+        local tmp = string.byte(str, start)
+        local i   = #arr
+        while arr[i] do
+            if tmp >= arr[i] then
+                break
+            end
+            i = i - 1
+        end
+        wordLen = i + wordLen
+        local tmpString = string.sub(str, start, wordLen)
+        start = start + i
+        left = left + i
+		unordered[#unordered+1] = tmpString
+    end
+end
+
+-- do a few seeded timesteps
+local seed_text = opt.primetext
+if string.len(seed_text) > 0 then
+    gprint('seeding with ' .. seed_text)
+    gprint('--------------------------')
+	local chars = get_char(seed_text)
+	print(chars)
+    for i,c in ipairs(chars)'.' do
+        prev_char = torch.Tensor{vocab[c]}
+        io.write(ivocab[prev_char[1]])
+        if opt.gpuid >= 0 then prev_char = prev_char:cuda() end
+        local lst = protos.rnn:forward{prev_char, unpack(current_state)}
+        -- lst is a list of [state1,state2,..stateN,output]. We want everything but last piece
+        current_state = {}
+        for i=1,state_size do table.insert(current_state, lst[i]) end
+        prediction = lst[#lst] -- last element holds the log probabilities
+    end
+else
+    -- fill with uniform probabilities over characters (? hmm)
+    gprint('missing seed text, using uniform probability over first character')
+    gprint('--------------------------')
+    prediction = torch.Tensor(1, #ivocab):fill(1)/(#ivocab)
+    if opt.gpuid >= 0 then prediction = prediction:cuda() end
+end
+
+-- start sampling/argmaxing
+for i=1, opt.length do
+
+    -- log probabilities from the previous timestep
+    if opt.sample == 0 then
+        -- use argmax
+        local _, prev_char_ = prediction:max(2)
+        prev_char = prev_char_:resize(1)
+    else
+        -- use sampling
+        prediction:div(opt.temperature) -- scale by temperature
+        local probs = torch.exp(prediction):squeeze()
+        probs:div(torch.sum(probs)) -- renormalize so probs sum to one
+        prev_char = torch.multinomial(probs:float(), 1):resize(1):float()
+    end
+
+    -- forward the rnn for next character
+    local lst = protos.rnn:forward{prev_char, unpack(current_state)}
+    current_state = {}
+    for i=1,state_size do table.insert(current_state, lst[i]) end
+    prediction = lst[#lst] -- last element holds the log probabilities
+
+    io.write(ivocab[prev_char[1]])
+end
+io.write('\n') io.flush()
+
--- a/train.lua
+++ b/train.lua
@ -0,0 +1,333 @@
+
+--[[
+
+This file trains a character-level multi-layer RNN on text data
+
+Code is based on implementation in 
+https://github.com/oxford-cs-ml-2015/practical6
+but modified to have multi-layer support, GPU support, as well as
+many other common model/optimization bells and whistles.
+The practical6 code is in turn based on 
+https://github.com/wojciechz/learning_to_execute
+which is turn based on other stuff in Torch, etc... (long lineage)
+
+]]--
+
+require 'torch'
+require 'nn'
+require 'nngraph'
+require 'optim'
+require 'lfs'
+
+require 'util.OneHot'
+require 'util.misc'
+local CharSplitLMMinibatchLoader = require 'util.CharSplitLMMinibatchLoader'
+local model_utils = require 'util.model_utils'
+local LSTM = require 'model.LSTM'
+
+cmd = torch.CmdLine()
+cmd:text()
+cmd:text('Train a character-level language model')
+cmd:text()
+cmd:text('Options')
+-- data
+cmd:option('-data_dir','data/tinyshakespeare','data directory. Should contain the file input.txt with input data')
+-- model params
+cmd:option('-rnn_size', 128, 'size of LSTM internal state')
+cmd:option('-num_layers', 2, 'number of layers in the LSTM')
+cmd:option('-model', 'lstm', 'for now only lstm is supported. keep fixed')
+-- optimization
+cmd:option('-learning_rate',2e-3,'learning rate')
+cmd:option('-learning_rate_decay',0.97,'learning rate decay')
+cmd:option('-learning_rate_decay_after',10,'in number of epochs, when to start decaying the learning rate')
+cmd:option('-decay_rate',0.95,'decay rate for rmsprop')
+cmd:option('-dropout',0,'dropout for regularization, used after each RNN hidden layer. 0 = no dropout')
+cmd:option('-seq_length',50,'number of timesteps to unroll for')
+cmd:option('-batch_size',50,'number of sequences to train on in parallel')
+cmd:option('-max_epochs',50,'number of full passes through the training data')
+cmd:option('-grad_clip',5,'clip gradients at this value')
+cmd:option('-train_frac',0.95,'fraction of data that goes into train set')
+cmd:option('-val_frac',0.05,'fraction of data that goes into validation set')
+            -- test_frac will be computed as (1 - train_frac - val_frac)
+cmd:option('-init_from', '', 'initialize network parameters from checkpoint at this path')
+-- bookkeeping
+cmd:option('-seed',123,'torch manual random number generator seed')
+cmd:option('-print_every',1,'how many steps/minibatches between printing out the loss')
+cmd:option('-eval_val_every',1000,'every how many iterations should we evaluate on validation data?')
+cmd:option('-checkpoint_dir', 'cv', 'output directory where checkpoints get written')
+cmd:option('-savefile','lstm','filename to autosave the checkpont to. Will be inside checkpoint_dir/')
+-- GPU/CPU
+cmd:option('-gpuid',0,'which gpu to use. -1 = use CPU')
+cmd:option('-opencl',0,'use OpenCL (instead of CUDA)')
+cmd:option('-min_freq',0,'min frequent of character')
+cmd:text()
+
+-- parse input params
+opt = cmd:parse(arg)
+torch.manualSeed(opt.seed)
+-- train / val / test split for data, in fractions
+local test_frac = math.max(0, 1 - (opt.train_frac + opt.val_frac))
+local split_sizes = {opt.train_frac, opt.val_frac, test_frac} 
+
+-- initialize cunn/cutorch for training on the GPU and fall back to CPU gracefully
+if opt.gpuid >= 0 and opt.opencl == 0 then
+    local ok, cunn = pcall(require, 'cunn')
+    local ok2, cutorch = pcall(require, 'cutorch')
+    if not ok then print('package cunn not found!') end
+    if not ok2 then print('package cutorch not found!') end
+    if ok and ok2 then
+        print('using CUDA on GPU ' .. opt.gpuid .. '...')
+        cutorch.setDevice(opt.gpuid + 1) -- note +1 to make it 0 indexed! sigh lua
+        cutorch.manualSeed(opt.seed)
+    else
+        print('If cutorch and cunn are installed, your CUDA toolkit may be improperly configured.')
+        print('Check your CUDA toolkit installation, rebuild cutorch and cunn, and try again.')
+        print('Falling back on CPU mode')
+        opt.gpuid = -1 -- overwrite user setting
+    end
+end
+
+-- initialize clnn/cltorch for training on the GPU and fall back to CPU gracefully
+if opt.gpuid >= 0 and opt.opencl == 1 then
+    local ok, cunn = pcall(require, 'clnn')
+    local ok2, cutorch = pcall(require, 'cltorch')
+    if not ok then print('package clnn not found!') end
+    if not ok2 then print('package cltorch not found!') end
+    if ok and ok2 then
+        print('using OpenCL on GPU ' .. opt.gpuid .. '...')
+        cltorch.setDevice(opt.gpuid + 1) -- note +1 to make it 0 indexed! sigh lua
+        torch.manualSeed(opt.seed)
+    else
+        print('If cltorch and clnn are installed, your OpenCL driver may be improperly configured.')
+        print('Check your OpenCL driver installation, check output of clinfo command, and try again.')
+        print('Falling back on CPU mode')
+        opt.gpuid = -1 -- overwrite user setting
+    end
+end
+
+-- create the data loader class
+local loader = CharSplitLMMinibatchLoader.create(opt.data_dir, opt.batch_size, opt.seq_length, split_sizes, opt.min_freq)
+local vocab_size = loader.vocab_size  -- the number of distinct characters
+local vocab = loader.vocab_mapping
+print('vocab size: ' .. vocab_size)
+-- make sure output directory exists
+if not path.exists(opt.checkpoint_dir) then lfs.mkdir(opt.checkpoint_dir) end
+
+-- define the model: prototypes for one timestep, then clone them in time
+local do_random_init = true
+if string.len(opt.init_from) > 0 then
+    print('loading an LSTM from checkpoint ' .. opt.init_from)
+    local checkpoint = torch.load(opt.init_from)
+    protos = checkpoint.protos
+    -- make sure the vocabs are the same
+    local vocab_compatible = true
+    for c,i in pairs(checkpoint.vocab) do 
+        if not vocab[c] == i then 
+            vocab_compatible = false
+        end
+    end
+    assert(vocab_compatible, 'error, the character vocabulary for this dataset and the one in the saved checkpoint are not the same. This is trouble.')
+    -- overwrite model settings based on checkpoint to ensure compatibility
+    print('overwriting rnn_size=' .. checkpoint.opt.rnn_size .. ', num_layers=' .. checkpoint.opt.num_layers .. ' based on the checkpoint.')
+    opt.rnn_size = checkpoint.opt.rnn_size
+    opt.num_layers = checkpoint.opt.num_layers
+    do_random_init = false
+else
+    print('creating an LSTM with ' .. opt.num_layers .. ' layers')
+    protos = {}
+    protos.rnn = LSTM.lstm(vocab_size, opt.rnn_size, opt.num_layers, opt.dropout)
+    protos.criterion = nn.ClassNLLCriterion()
+end
+
+-- the initial state of the cell/hidden states
+init_state = {}
+for L=1,opt.num_layers do
+    local h_init = torch.zeros(opt.batch_size, opt.rnn_size)
+    if opt.gpuid >=0 and opt.opencl == 0 then h_init = h_init:cuda() end
+    if opt.gpuid >=0 and opt.opencl == 1 then h_init = h_init:cl() end
+    table.insert(init_state, h_init:clone())
+    table.insert(init_state, h_init:clone())
+end
+
+-- ship the model to the GPU if desired
+if opt.gpuid >= 0 and opt.opencl == 0 then
+    for k,v in pairs(protos) do v:cuda() end
+end
+if opt.gpuid >= 0 and opt.opencl == 1 then
+    for k,v in pairs(protos) do v:cl() end
+end
+
+-- put the above things into one flattened parameters tensor
+params, grad_params = model_utils.combine_all_parameters(protos.rnn)
+
+-- initialization
+if do_random_init then
+params:uniform(-0.08, 0.08) -- small numbers uniform
+end
+
+print('number of parameters in the model: ' .. params:nElement())
+-- make a bunch of clones after flattening, as that reallocates memory
+clones = {}
+for name,proto in pairs(protos) do
+    print('cloning ' .. name)
+    clones[name] = model_utils.clone_many_times(proto, opt.seq_length, not proto.parameters)
+end
+
+-- evaluate the loss over an entire split
+function eval_split(split_index, max_batches)
+    print('evaluating loss over split index ' .. split_index)
+    local n = loader.split_sizes[split_index]
+    if max_batches ~= nil then n = math.min(max_batches, n) end
+
+    loader:reset_batch_pointer(split_index) -- move batch iteration pointer for this split to front
+    local loss = 0
+    local rnn_state = {[0] = init_state}
+    
+    for i = 1,n do -- iterate over batches in the split
+        -- fetch a batch
+        local x, y = loader:next_batch(split_index)
+        if opt.gpuid >= 0 and opt.opencl == 0 then -- ship the input arrays to GPU
+            -- have to convert to float because integers can't be cuda()'d
+            x = x:float():cuda()
+            y = y:float():cuda()
+        end
+        if opt.gpuid >= 0 and opt.opencl == 1 then -- ship the input arrays to GPU
+            x = x:cl()
+            y = y:cl()
+        end
+        -- forward pass
+        for t=1,opt.seq_length do
+            clones.rnn[t]:evaluate() -- for dropout proper functioning
+            local lst = clones.rnn[t]:forward{x[{{}, t}], unpack(rnn_state[t-1])}
+            rnn_state[t] = {}
+            for i=1,#init_state do table.insert(rnn_state[t], lst[i]) end
+            prediction = lst[#lst] 
+            loss = loss + clones.criterion[t]:forward(prediction, y[{{}, t}])
+        end
+        -- carry over lstm state
+        rnn_state[0] = rnn_state[#rnn_state]
+        print(i .. '/' .. n .. '...')
+    end
+
+    loss = loss / opt.seq_length / n
+    return loss
+end
+
+-- do fwd/bwd and return loss, grad_params
+local init_state_global = clone_list(init_state)
+function feval(x)
+    if x ~= params then
+        params:copy(x)
+    end
+    grad_params:zero()
+
+    ------------------ get minibatch -------------------
+    local x, y = loader:next_batch(1)
+    if opt.gpuid >= 0 and opt.opencl == 0 then -- ship the input arrays to GPU
+        -- have to convert to float because integers can't be cuda()'d
+        x = x:float():cuda()
+        y = y:float():cuda()
+    end
+    if opt.gpuid >= 0 and opt.opencl == 1 then -- ship the input arrays to GPU
+        x = x:cl()
+        y = y:cl()
+    end
+    ------------------- forward pass -------------------
+    local rnn_state = {[0] = init_state_global}
+    local predictions = {}           -- softmax outputs
+    local loss = 0
+    for t=1,opt.seq_length do
+        clones.rnn[t]:training() -- make sure we are in correct mode (this is cheap, sets flag)
+        local lst = clones.rnn[t]:forward{x[{{}, t}], unpack(rnn_state[t-1])}
+        rnn_state[t] = {}
+        for i=1,#init_state do table.insert(rnn_state[t], lst[i]) end -- extract the state, without output
+        predictions[t] = lst[#lst] -- last element is the prediction
+        loss = loss + clones.criterion[t]:forward(predictions[t], y[{{}, t}])
+    end
+    loss = loss / opt.seq_length
+    ------------------ backward pass -------------------
+    -- initialize gradient at time t to be zeros (there's no influence from future)
+    local drnn_state = {[opt.seq_length] = clone_list(init_state, true)} -- true also zeros the clones
+    for t=opt.seq_length,1,-1 do
+        -- backprop through loss, and softmax/linear
+        local doutput_t = clones.criterion[t]:backward(predictions[t], y[{{}, t}])
+        table.insert(drnn_state[t], doutput_t)
+        local dlst = clones.rnn[t]:backward({x[{{}, t}], unpack(rnn_state[t-1])}, drnn_state[t])
+        drnn_state[t-1] = {}
+        for k,v in pairs(dlst) do
+            if k > 1 then -- k == 1 is gradient on x, which we dont need
+                -- note we do k-1 because first item is dembeddings, and then follow the 
+                -- derivatives of the state, starting at index 2. I know...
+                drnn_state[t-1][k-1] = v
+            end
+        end
+    end
+    ------------------------ misc ----------------------
+    -- transfer final state to initial state (BPTT)
+    init_state_global = rnn_state[#rnn_state] -- NOTE: I don't think this needs to be a clone, right?
+    -- clip gradient element-wise
+    grad_params:clamp(-opt.grad_clip, opt.grad_clip)
+    return loss, grad_params
+end
+
+-- start optimization here
+train_losses = {}
+val_losses = {}
+local optim_state = {learningRate = opt.learning_rate, alpha = opt.decay_rate}
+local iterations = opt.max_epochs * loader.ntrain
+local iterations_per_epoch = loader.ntrain
+local loss0 = nil
+for i = 1, iterations do
+    local epoch = i / loader.ntrain
+
+    local timer = torch.Timer()
+    local _, loss = optim.rmsprop(feval, params, optim_state)
+    local time = timer:time().real
+
+    local train_loss = loss[1] -- the loss is inside a list, pop it
+    train_losses[i] = train_loss
+
+    -- exponential learning rate decay
+    if i % loader.ntrain == 0 and opt.learning_rate_decay < 1 then
+        if epoch >= opt.learning_rate_decay_after then
+            local decay_factor = opt.learning_rate_decay
+            optim_state.learningRate = optim_state.learningRate * decay_factor -- decay it
+            print('decayed learning rate by a factor ' .. decay_factor .. ' to ' .. optim_state.learningRate)
+        end
+    end
+
+    -- every now and then or on last iteration
+    if i % opt.eval_val_every == 0 or i == iterations then
+        -- evaluate loss on validation data
+        local val_loss = eval_split(2) -- 2 = validation
+        val_losses[i] = val_loss
+
+        local savefile = string.format('%s/lm_%s_epoch%.2f_%.4f.t7', opt.checkpoint_dir, opt.savefile, epoch, val_loss)
+        print('saving checkpoint to ' .. savefile)
+        local checkpoint = {}
+        checkpoint.protos = protos
+        checkpoint.opt = opt
+        checkpoint.train_losses = train_losses
+        checkpoint.val_loss = val_loss
+        checkpoint.val_losses = val_losses
+        checkpoint.i = i
+        checkpoint.epoch = epoch
+        checkpoint.vocab = loader.vocab_mapping
+        torch.save(savefile, checkpoint)
+    end
+
+    if i % opt.print_every == 0 then
+        print(string.format("%d/%d (epoch %.3f), train_loss = %6.8f, grad/param norm = %6.4e, time/batch = %.2fs", i, iterations, epoch, train_loss, grad_params:norm() / params:norm(), time))
+    end
+   
+    if i % 10 == 0 then collectgarbage() end
+
+    -- handle early stopping if things are going really bad
+    if loss0 == nil then loss0 = loss[1] end
+    if loss[1] > loss0 * 3 then
+        print('loss is exploding, aborting.')
+        break -- halt
+    end
+end
+
+
--- a/util/CharSplitLMMinibatchLoader.lua
+++ b/util/CharSplitLMMinibatchLoader.lua
@ -0,0 +1,235 @@
+
+-- Modified from https://github.com/oxford-cs-ml-2015/practical6
+-- the modification included support for train/val/test splits
+
+local CharSplitLMMinibatchLoader = {}
+CharSplitLMMinibatchLoader.__index = CharSplitLMMinibatchLoader
+
+function CharSplitLMMinibatchLoader.create(data_dir, batch_size, seq_length, split_fractions, min_freq)
+    -- split_fractions is e.g. {0.9, 0.05, 0.05}
+
+    local self = {}
+    setmetatable(self, CharSplitLMMinibatchLoader)
+
+    local input_file = path.join(data_dir, 'input.txt')
+    local vocab_file = path.join(data_dir, 'vocab.t7')
+    local tensor_file = path.join(data_dir, 'data.t7')
+
+    -- fetch file attributes to determine if we need to rerun preprocessing
+    local run_prepro = false
+    if not (path.exists(vocab_file) or path.exists(tensor_file)) then
+        -- prepro files do not exist, generate them
+        print('vocab.t7 and data.t7 do not exist. Running preprocessing...')
+        run_prepro = true
+    else
+        -- check if the input file was modified since last time we 
+        -- ran the prepro. if so, we have to rerun the preprocessing
+        local input_attr = lfs.attributes(input_file)
+        local vocab_attr = lfs.attributes(vocab_file)
+        local tensor_attr = lfs.attributes(tensor_file)
+        if input_attr.modification > vocab_attr.modification or input_attr.modification > tensor_attr.modification then
+            print('vocab.t7 or data.t7 detected as stale. Re-running preprocessing...')
+            run_prepro = true
+        end
+    end
+    if run_prepro then
+        -- construct a tensor with all the data, and vocab file
+        print('one-time setup: preprocessing input text file ' .. input_file .. '...')
+        CharSplitLMMinibatchLoader.text_to_tensor(input_file, vocab_file, tensor_file, min_freq)
+    end
+
+    print('loading data files...')
+    local data = torch.load(tensor_file)
+    self.vocab_mapping = torch.load(vocab_file)
+
+    -- cut off the end so that it divides evenly
+    local len = data:size(1)
+    if len % (batch_size * seq_length) ~= 0 then
+        print('cutting off end of data so that the batches/sequences divide evenly')
+        data = data:sub(1, batch_size * seq_length 
+                    * math.floor(len / (batch_size * seq_length)))
+    end
+
+    -- count vocab
+    self.vocab_size = 0
+    for _ in pairs(self.vocab_mapping) do 
+        self.vocab_size = self.vocab_size + 1 
+    end
+
+    -- self.batches is a table of tensors
+    print('reshaping tensor...')
+    self.batch_size = batch_size
+    self.seq_length = seq_length
+
+    local ydata = data:clone()
+    ydata:sub(1,-2):copy(data:sub(2,-1))
+    ydata[-1] = data[1]
+    self.x_batches = data:view(batch_size, -1):split(seq_length, 2)  -- #rows = #batches
+    self.nbatches = #self.x_batches
+    self.y_batches = ydata:view(batch_size, -1):split(seq_length, 2)  -- #rows = #batches
+    assert(#self.x_batches == #self.y_batches)
+
+    -- lets try to be helpful here
+    if self.nbatches < 50 then
+        print('WARNING: less than 50 batches in the data in total? Looks like very small dataset. You probably want to use smaller batch_size and/or seq_length.')
+    end
+
+    -- perform safety checks on split_fractions
+    assert(split_fractions[1] >= 0 and split_fractions[1] <= 1, 'bad split fraction ' .. split_fractions[1] .. ' for train, not between 0 and 1')
+    assert(split_fractions[2] >= 0 and split_fractions[2] <= 1, 'bad split fraction ' .. split_fractions[2] .. ' for val, not between 0 and 1')
+    assert(split_fractions[3] >= 0 and split_fractions[3] <= 1, 'bad split fraction ' .. split_fractions[3] .. ' for test, not between 0 and 1')
+    if split_fractions[3] == 0 then 
+        -- catch a common special case where the user might not want a test set
+        self.ntrain = math.floor(self.nbatches * split_fractions[1])
+        self.nval = self.nbatches - self.ntrain
+        self.ntest = 0
+    else
+        -- divide data to train/val and allocate rest to test
+        self.ntrain = math.floor(self.nbatches * split_fractions[1])
+        self.nval = math.floor(self.nbatches * split_fractions[2])
+        self.ntest = self.nbatches - self.nval - self.ntrain -- the rest goes to test (to ensure this adds up exactly)
+    end
+
+    self.split_sizes = {self.ntrain, self.nval, self.ntest}
+    self.batch_ix = {0,0,0}
+
+    print(string.format('data load done. Number of data batches in train: %d, val: %d, test: %d', self.ntrain, self.nval, self.ntest))
+    collectgarbage()
+    return self
+end
+
+function CharSplitLMMinibatchLoader:reset_batch_pointer(split_index, batch_index)
+    batch_index = batch_index or 0
+    self.batch_ix[split_index] = batch_index
+end
+
+function CharSplitLMMinibatchLoader:next_batch(split_index)
+    if self.split_sizes[split_index] == 0 then
+        -- perform a check here to make sure the user isn't screwing something up
+        local split_names = {'train', 'val', 'test'}
+        print('ERROR. Code requested a batch for split ' .. split_names[split_index] .. ', but this split has no data.')
+        os.exit() -- crash violently
+    end
+    -- split_index is integer: 1 = train, 2 = val, 3 = test
+    self.batch_ix[split_index] = self.batch_ix[split_index] + 1
+    if self.batch_ix[split_index] > self.split_sizes[split_index] then
+        self.batch_ix[split_index] = 1 -- cycle around to beginning
+    end
+    -- pull out the correct next batch
+    local ix = self.batch_ix[split_index]
+    if split_index == 2 then ix = ix + self.ntrain end -- offset by train set size
+    if split_index == 3 then ix = ix + self.ntrain + self.nval end -- offset by train + val
+    return self.x_batches[ix], self.y_batches[ix]
+end
+
+-- chinese vocab
+function get_vocab(str, min_freq)
+    local len  = #str
+    local left = 0
+    local arr  = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
+    local unordered = {}
+    local start = 1
+    local wordLen = 0
+	g_total_chars = 0
+    while len ~= left do
+        local tmp = string.byte(str, start)
+        local i   = #arr
+        while arr[i] do
+            if tmp >= arr[i] then
+                break
+            end
+            i = i - 1
+        end
+        wordLen = i + wordLen
+        local tmpString = string.sub(str, start, wordLen)
+        start = start + i
+        left = left + i
+        if not unordered[tmpString] then 
+			unordered[tmpString] = 1
+		else 
+			unordered[tmpString] = unordered[tmpString] + 1 end
+		g_total_chars = g_total_chars + 1
+    end
+	final_res = {}
+	for char_val, char_cnt in pairs(unordered) do
+		if char_cnt >= min_freq then
+			final_res[char_val] = true
+		end
+	end
+    return final_res
+end
+
+-- change raw data to tokens
+function get_data(str, vocab_mapping)
+	-- can not use torch.ByteTensor because it support mo more than 256
+	-- local data = torch.ByteTensor(g_total_chars) -- store it into 1D first, then rearrange
+	local data = torch.ShortTensor(g_total_chars)
+    local len  = #str
+    local left = 0
+    local arr  = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
+    local start = 1
+    local wordLen = 0
+	local count = 1
+    while len ~= left do
+        local tmp = string.byte(str, start)
+        local i = #arr
+        while arr[i] do
+            if tmp >= arr[i] then
+                break
+            end
+            i = i - 1
+        end
+        wordLen = i + wordLen
+        local tmpString = string.sub(str, start, wordLen)
+        start = start + i
+        left = left + i
+		if vocab_mapping[tmpString] then
+			data[count] = vocab_mapping[tmpString] 
+		else
+			data[count] = vocab_mapping['UNKNOW']
+		end
+		count = count + 1
+    end
+	return data
+end
+
+-- *** STATIC method ***
+function CharSplitLMMinibatchLoader.text_to_tensor(in_textfile, out_vocabfile, out_tensorfile, min_freq)
+    local timer = torch.Timer()
+
+    print('loading text file...')
+    local f = torch.DiskFile(in_textfile)
+    local rawdata = f:readString('*a') -- NOTE: this reads the whole file at once
+    f:close()
+
+    -- create vocabulary if it doesn't exist yet
+    print('creating vocabulary mapping...')
+    -- record all characters to a set
+    local unordered = get_vocab(rawdata, min_freq)
+    -- sort into a table (i.e. keys become 1..N)
+    local ordered = {}
+    for char in pairs(unordered) do ordered[#ordered + 1] = char end
+    table.sort(ordered)
+    -- invert `ordered` to create the char->int mapping
+    local vocab_mapping = {}
+
+	count_vocab = 0
+    for i, char in ipairs(ordered) do
+        vocab_mapping[char] = i
+		count_vocab = count_vocab + 1
+    end
+	vocab_mapping['UNKNOW'] = count_vocab + 1
+
+    -- construct a tensor with all the data
+    print('putting data into tensor, it takes a lot of time...')
+    local data = get_data(rawdata, vocab_mapping)
+
+    -- save output preprocessed files
+    print('saving ' .. out_vocabfile)
+    torch.save(out_vocabfile, vocab_mapping)
+    print('saving ' .. out_tensorfile)
+    torch.save(out_tensorfile, data)
+end
+
+return CharSplitLMMinibatchLoader
+
--- a/util/OneHot.lua
+++ b/util/OneHot.lua
@ -0,0 +1,20 @@
+
+local OneHot, parent = torch.class('OneHot', 'nn.Module')
+
+function OneHot:__init(outputSize)
+  parent.__init(self)
+  self.outputSize = outputSize
+  -- We'll construct one-hot encodings by using the index method to
+  -- reshuffle the rows of an identity matrix. To avoid recreating
+  -- it every iteration we'll cache it.
+  self._eye = torch.eye(outputSize)
+end
+
+function OneHot:updateOutput(input)
+  self.output:resize(input:size(1), self.outputSize):zero()
+  if self._eye == nil then self._eye = torch.eye(self.outputSize) end
+  self._eye = self._eye:float()
+  local longInput = input:long()
+  self.output:copy(self._eye:index(1, longInput))
+  return self.output
+end
--- a/util/misc.lua
+++ b/util/misc.lua
@ -0,0 +1,13 @@
+
+-- misc utilities
+
+function clone_list(tensor_list, zero_too)
+    -- utility function. todo: move away to some utils file?
+    -- takes a list of tensors and returns a list of cloned tensors
+    local out = {}
+    for k,v in pairs(tensor_list) do
+        out[k] = v:clone()
+        if zero_too then out[k]:zero() end
+    end
+    return out
+end
--- a/util/model_utils.lua
+++ b/util/model_utils.lua
@ -0,0 +1,162 @@
+
+-- adapted from https://github.com/wojciechz/learning_to_execute
+-- utilities for combining/flattening parameters in a model
+-- the code in this script is more general than it needs to be, which is 
+-- why it is kind of a large
+
+require 'torch'
+local model_utils = {}
+function model_utils.combine_all_parameters(...)
+    --[[ like module:getParameters, but operates on many modules ]]--
+
+    -- get parameters
+    local networks = {...}
+    local parameters = {}
+    local gradParameters = {}
+    for i = 1, #networks do
+        local net_params, net_grads = networks[i]:parameters()
+
+        if net_params then
+            for _, p in pairs(net_params) do
+                parameters[#parameters + 1] = p
+            end
+            for _, g in pairs(net_grads) do
+                gradParameters[#gradParameters + 1] = g
+            end
+        end
+    end
+
+    local function storageInSet(set, storage)
+        local storageAndOffset = set[torch.pointer(storage)]
+        if storageAndOffset == nil then
+            return nil
+        end
+        local _, offset = unpack(storageAndOffset)
+        return offset
+    end
+
+    -- this function flattens arbitrary lists of parameters,
+    -- even complex shared ones
+    local function flatten(parameters)
+        if not parameters or #parameters == 0 then
+            return torch.Tensor()
+        end
+        local Tensor = parameters[1].new
+
+        local storages = {}
+        local nParameters = 0
+        for k = 1,#parameters do
+            local storage = parameters[k]:storage()
+            if not storageInSet(storages, storage) then
+                storages[torch.pointer(storage)] = {storage, nParameters}
+                nParameters = nParameters + storage:size()
+            end
+        end
+
+        local flatParameters = Tensor(nParameters):fill(1)
+        local flatStorage = flatParameters:storage()
+
+        for k = 1,#parameters do
+            local storageOffset = storageInSet(storages, parameters[k]:storage())
+            parameters[k]:set(flatStorage,
+                storageOffset + parameters[k]:storageOffset(),
+                parameters[k]:size(),
+                parameters[k]:stride())
+            parameters[k]:zero()
+        end
+
+        local maskParameters=  flatParameters:float():clone()
+        local cumSumOfHoles = flatParameters:float():cumsum(1)
+        local nUsedParameters = nParameters - cumSumOfHoles[#cumSumOfHoles]
+        local flatUsedParameters = Tensor(nUsedParameters)
+        local flatUsedStorage = flatUsedParameters:storage()
+
+        for k = 1,#parameters do
+            local offset = cumSumOfHoles[parameters[k]:storageOffset()]
+            parameters[k]:set(flatUsedStorage,
+                parameters[k]:storageOffset() - offset,
+                parameters[k]:size(),
+                parameters[k]:stride())
+        end
+
+        for _, storageAndOffset in pairs(storages) do
+            local k, v = unpack(storageAndOffset)
+            flatParameters[{{v+1,v+k:size()}}]:copy(Tensor():set(k))
+        end
+
+        if cumSumOfHoles:sum() == 0 then
+            flatUsedParameters:copy(flatParameters)
+        else
+            local counter = 0
+            for k = 1,flatParameters:nElement() do
+                if maskParameters[k] == 0 then
+                    counter = counter + 1
+                    flatUsedParameters[counter] = flatParameters[counter+cumSumOfHoles[k]]
+                end
+            end
+            assert (counter == nUsedParameters)
+        end
+        return flatUsedParameters
+    end
+
+    -- flatten parameters and gradients
+    local flatParameters = flatten(parameters)
+    local flatGradParameters = flatten(gradParameters)
+
+    -- return new flat vector that contains all discrete parameters
+    return flatParameters, flatGradParameters
+end
+
+
+
+
+function model_utils.clone_many_times(net, T)
+    local clones = {}
+
+    local params, gradParams
+    if net.parameters then
+        params, gradParams = net:parameters()
+        if params == nil then
+            params = {}
+        end
+    end
+
+    local paramsNoGrad
+    if net.parametersNoGrad then
+        paramsNoGrad = net:parametersNoGrad()
+    end
+
+    local mem = torch.MemoryFile("w"):binary()
+    mem:writeObject(net)
+
+    for t = 1, T do
+        -- We need to use a new reader for each clone.
+        -- We don't want to use the pointers to already read objects.
+        local reader = torch.MemoryFile(mem:storage(), "r"):binary()
+        local clone = reader:readObject()
+        reader:close()
+
+        if net.parameters then
+            local cloneParams, cloneGradParams = clone:parameters()
+            local cloneParamsNoGrad
+            for i = 1, #params do
+                cloneParams[i]:set(params[i])
+                cloneGradParams[i]:set(gradParams[i])
+            end
+            if paramsNoGrad then
+                cloneParamsNoGrad = clone:parametersNoGrad()
+                for i =1,#paramsNoGrad do
+                    cloneParamsNoGrad[i]:set(paramsNoGrad[i])
+                end
+            end
+        end
+
+        clones[t] = clone
+        collectgarbage()
+    end
+
+    mem:close()
+    return clones
+end
+
+return model_utils