LSTM 模型预测 A 股走势

这几天闲来无事，看网上很多人都在评论股市的起起伏伏，于是心血来潮，想用深度学习看看能否预测股票走势。由于之前研究方向都是 CV 和 NAS，也想借用这个机会深入了解下 RNN 这个派系是怎么回事。

数日后

经过我多日的观察，LSTM 似乎并不能预测 A 股走势。当然由于我在这方面还是初学者，所以代码有可能存在问题，欢迎大家指出！

下载数据集

证券宝 www.baostock.com 是一个免费、开源的证券数据平台（无需注册）。其优点请访问官网

取消下面代码注释来安装 baostock 库

按照官网教程下载一只股票的k线数据，这里取sz.002648 15年以后的数据。

import baostock as bs
import pandas as pd

#### 登陆系统 ####
lg = bs.login()

#### 获取沪深A股历史K线数据 ####
# 详细指标参数，参见“历史行情指标参数”章节；“分钟线”参数与“日线”参数不同。
# 分钟线指标：date,time,code,open,high,low,close,volume,amount,adjustflag
rs = bs.query_history_k_data_plus("sz.002648",
    "date,code,open,high,low,close,preclose,volume,amount,adjustflag,turn,tradestatus,pctChg,isST",
    start_date='2015-01-01', end_date='2020-4-14',
    frequency="d", adjustflag="3")

#### 打印结果集 ####
data_list = []
while (rs.error_code == '0') & rs.next():
    # 获取一条记录，将记录合并在一起
    data_list.append(rs.get_row_data())
df = pd.DataFrame(data_list, columns=rs.fields)

#### 结果集输出到csv文件 ####   
# result.to_csv("D:\\history_A_stock_k_data.csv", index=False)
print(df)

#### 登出系统 ####
bs.logout()

login success!
            date       code     open     high      low    close preclose  \
0     2015-01-05  sz.002648  12.0400  12.4700  11.8000  12.2200  12.1000   
1     2015-01-06  sz.002648  12.2000  12.3600  12.0000  12.2900  12.2200   
2     2015-01-07  sz.002648  12.2900  12.5600  12.2500  12.4000  12.2900   
3     2015-01-08  sz.002648  12.4300  12.6300  12.3000  12.4700  12.4000   
4     2015-01-09  sz.002648  12.4600  12.7400  12.3900  12.4100  12.4700   
...          ...        ...      ...      ...      ...      ...      ...   
1281  2020-04-08  sz.002648  13.7400  14.1200  13.6500  13.9700  13.8400   
1282  2020-04-09  sz.002648  14.0800  14.2900  14.0300  14.1800  13.9700   
1283  2020-04-10  sz.002648  14.0500  14.0500  13.5400  13.7600  14.1800   
1284  2020-04-13  sz.002648  14.1900  15.1400  14.0000  14.9600  13.7600   
1285  2020-04-14  sz.002648  15.0600  15.2000  14.8000  14.8900  14.9600   

        volume          amount adjustflag      turn tradestatus     pctChg  \
0     16718083  201474357.0000          3  2.089760           1   0.991700   
1      7965196   97122442.0000          3  0.995649           1   0.572800   
2      7684398   95370790.0000          3  0.960550           1   0.895000   
3      8056201  100447582.0000          3  1.007025           1   0.564500   
4      7648582   96383568.0000          3  0.956073           1  -0.481200   
...        ...             ...        ...       ...         ...        ...   
1281  15477705  214734917.0800          3  1.490800           1   0.939300   
1282  16011717  226975113.8300          3  1.542300           1   1.503200   
1283  17320147  238855999.3500          3  1.669400           1  -2.961900   
1284  56552045  834440561.6200          3  5.450700           1   8.720900   
1285  42736530  638906444.8700          3  4.119100           1  -0.467900   

     isST  
0       0  
1       0  
2       0  
3       0  
4       0  
...   ...  
1281    0  
1282    0  
1283    0  
1284    0  
1285    0  

[1286 rows x 14 columns]
logout success!

<baostock.data.resultset.ResultData at 0x11cdb33d0>

下载的数据都是以字符串形式保存的，我们把需要的数据转换成整数和浮点数

float_type = ['open','high','low','close','preclose','amount','pctChg']

for item in float_type:
    df[item] = df[item].astype('float')

df['amount'] = df['amount'].astype('int')
df['volume'] = df['volume'].astype('int')
df['turn'] = [0 if x == "" else float(x) for x in df["turn"]]
df['buy_flag'] = 10

# df.tail()

处理数据集

用 LSTM 预测价格显示是不合理的，因为价格的波动非常不可控，所以我们退而求其次，预测股票的走势，即涨还是跌。

但是怎么量化股票的涨跌是个问题，本人之前完全没接触过股票，所以这里就想当然用未来数天的平均股价表示股票的起伏。

def MA_next(df, date_idx, price_type, n): 
    return df[price_type][date_idx:date_idx+n].mean()

假设短期2天，中期6天，长期15天。如果未来15天平均价格大于未来6天平均价格大于未来2天平均价格，我们就可认为未来15天的股市走势很好。这里还要求有3%的涨幅，能一定程度上减少标签频繁波动。

'2'含义为买入，'0'含义为卖出，'1'为默认值

s_time = 2
m_time = 6
l_time = 15

for i in range(len(df)-l_time):
    if MA_next(df,i,'close',l_time)>MA_next(df,i,'close',m_time)*1.03>MA_next(df,i,'close',s_time)*1.03:
        df.loc[i, 'buy_flag'] = 2
    elif MA_next(df,i,'close',s_time)>MA_next(df,i,'close',m_time):
        df.loc[i, 'buy_flag'] = 0
    else:
        df.loc[i, 'buy_flag'] = 1
        df.loc[i, 'buy_flag'] = 1 + (MA_next(df,i,'close',m_time)-MA_next(df,i,'close',s_time))/MA_next(df,i,'close',s_time)
#     df.loc[i, 'buy_flag'] = 10*(MA_next(df,i,'close',m_time)+MA_next(df,i,'close',l_time)-2*MA_next(df,i,'close',s_time))/MA_next(df,i,'close',s_time)
        
df.tail()

可视化

使用 plotly 绘图

import plotly.graph_objects as go
# from IPython.display import HTML
import chart_studio.plotly as py

fig = go.Figure(data=[go.Candlestick(x=df['date'],
                open=df['open'], high=df['high'],
                low=df['low'], close=df['close'],
                increasing_line_color= 'red', decreasing_line_color= 'green')
                     ])

fig.add_trace(go.Scatter(x=df['date'],y=df['buy_flag'], name='Flag'))

fig.update_layout(
    xaxis_range=['2017-01-01','2019-12-31'],
    yaxis_title='Price',
#     xaxis_rangeslider_visible=False,
)

py.iplot(fig, filename="stock-price")

在 Fast.ai Part 1 课程中，提到一个能扩展日期特征的函数add_datepart，该函数能计算当前日期的年、月、日、一周第几天、周数、月初月末、一年当中的第几天等信息。我们用该函数扩展日期特征。

from fastai.tabular import *
add_datepart(df, "date", drop=False)
seq_length = 90
train_df = df[seq_length:-seq_length]
# 丢掉不重要的特征
train_df = train_df.drop(['date','code','Is_month_end', 'Is_month_start', 'Is_quarter_end',
                          'Is_quarter_start', 'Is_year_end', 'Is_year_start','Dayofyear'],axis=1)
train_df

接下来我们为数据生成序列，用前seq_length天的信息作为输入序列，后1天的股市起伏buy_flag作为标签

def sliding_windows(data, label, seq_length):
    x = []
    y = []

    for i in range(len(data)-seq_length-1):
        _x = data[i:(i+seq_length)]
        _y = label[i+seq_length]
        x.append(_x)
        y.append(_y)

    return np.array(x),np.array(y)

在 Pytorch 中，LSTM 默认的输入顺序是 seq_length*batch_size*feature，而我们通常生成的序列是batch_size*seq_length*feature，因此需要交换下输入数据纬度。

from sklearn.preprocessing import MinMaxScaler
import numpy as np

y_scaler = MinMaxScaler()
x_scaler = MinMaxScaler()

#converting dataset into x_train and y_train
X = train_df.drop(['buy_flag'],axis=1).values
X = x_scaler.fit_transform(X)
Y = train_df['buy_flag']
Y = np.array(Y).reshape(-1,1)
Y = y_scaler.fit_transform(Y)

x, y = sliding_windows(X, Y, seq_length)

y_train,y_test = y[:int(y.shape[0]*0.8)],y[int(y.shape[0]*0.8):]
x_train,x_test = x[:int(x.shape[0]*0.8)],x[int(x.shape[0]*0.8):]

# lstm: seq, batch, feature
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

dataX = torch.Tensor(x.transpose(1,0,2))
dataY = torch.Tensor(y)
trainX = torch.Tensor(x_train.transpose(1,0,2))
trainY = torch.Tensor(y_train)
testX = torch.Tensor(x_test.transpose(1,0,2))
testY = torch.Tensor(y_test)
trainX.shape, trainY.shape

(torch.Size([90, 812, 18]), torch.Size([812, 1]))

建立 LSTM 模型

class LSTM(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(LSTM, self).__init__()
        
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.seq_length = seq_length
        
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size,
                            num_layers=num_layers)
        
        self.fc = nn.Linear(hidden_size, num_classes)
        

    def forward(self, x):
        # 不手动指定 h 和 c 的话，默认就是 0
#         h_0 = torch.zeros(
#             self.num_layers, x.size(0), self.hidden_size)
        
#         c_0 = torch.zeros(
#             self.num_layers, x.size(0), self.hidden_size)
        
        # Propagate input through LSTM
#         ula, (h_out, _) = self.lstm(x, (h_0, c_0))
        ula, (h_out, _) = self.lstm(x)
        
        h_out = h_out.view(-1, self.hidden_size)
        
        out = self.fc(h_out)
        
        return out

训练模型

num_epochs = 15
learning_rate = 0.001

input_size = train_df.shape[1]-1 # The number of expected features in the input x
hidden_size = 300 # The number of features in the hidden state h
num_layers = 1 # Number of recurrent layers.

num_classes = 1 # output

lstm = LSTM(num_classes, input_size, hidden_size, num_layers)

criterion = torch.nn.MSELoss()    # mean-squared error for regression
optimizer = torch.optim.Adam(lstm.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(lstm.parameters(), lr=learning_rate)

# Train the model
lstm.train()
lstm.to(device)
trainX = trainX.to(device)
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = lstm(trainX)
    
    # obtain the loss function
    loss = criterion(outputs, trainY)
    
    loss.backward()
    
    optimizer.step()
    print("Epoch: %d, loss: %1.5f" % (epoch, loss.item()))

Epoch: 0, loss: 0.22898
Epoch: 1, loss: 0.17325
Epoch: 2, loss: 0.14031
Epoch: 3, loss: 0.13380
Epoch: 4, loss: 0.14764
Epoch: 5, loss: 0.14571
Epoch: 6, loss: 0.13712
Epoch: 7, loss: 0.13153
Epoch: 8, loss: 0.12988
Epoch: 9, loss: 0.13052
Epoch: 10, loss: 0.13184
Epoch: 11, loss: 0.13284
Epoch: 12, loss: 0.13308
Epoch: 13, loss: 0.13251
Epoch: 14, loss: 0.13131

查看训练效果

import plotly.graph_objects as go

lstm.eval()
lstm.to(torch.device('cpu'))
with torch.no_grad():
    dataY_pred = lstm(dataX)

dataY_pred = dataY_pred.data.numpy()
dataY_truth = dataY.data.numpy()

dataY_pred = y_scaler.inverse_transform(dataY_pred)
dataY_truth = y_scaler.inverse_transform(dataY_truth)


fig = go.Figure(go.Scatter(y=dataY_truth.flatten(),name='Ground Truth'))
fig.add_trace(go.Scatter(y=dataY_pred.flatten(),name='Predicted'))

fig.update_layout(
    shapes = [dict(
        x0=len(x_train), x1=len(x_train), y0=0, y1=1, xref='x', yref='paper',
        line_width=2)], #在图上划分训练集和测试集
    xaxis_rangeslider_visible=True,
)
py.iplot(fig, filename="stock-result")

发现预测值基本取标签的平均值，也就是说它并不会根据输入调整输出，而是直接输出标签的平均值，没有任何参考价值

import random
i = random.randint(0,testX.shape[1])
with torch.no_grad():
    y_pred = lstm(testX[:,i,::].reshape(testX.shape[0],1,testX.shape[2]))
print('预测值:{0}, 实际值:{1}'.format(y_pred.data.numpy(),testY[i].reshape(-1,1)))

预测值:[[0.288591]], 实际值:tensor([[1.]])

参考资料

超生动图解LSTM和GPU，一文读懂循环神经网络！

LSTM细节分析理解（pytorch版）

	open	high	low	close	preclose	volume	amount	adjustflag	turn	tradestatus	pctChg	isST	buy_flag	Year	Month	Week	Day	Dayofweek	Elapsed
90	16.15	16.85	16.07	16.28	16.19	20654461	340943328	3	2.581808	1	0.5559	0	2.000000	2015	5	21	20	2	1432080000
91	16.45	16.84	16.30	16.81	16.28	21049050	349775744	3	2.631131	1	3.2555	0	2.000000	2015	5	21	21	3	1432166400
92	16.91	17.50	16.72	17.25	16.81	26445834	455184800	3	3.305729	1	2.6175	0	2.000000	2015	5	21	22	4	1432252800
93	17.20	17.72	16.93	17.52	17.25	25543119	444374592	3	3.192890	1	1.5652	0	1.070076	2015	5	22	25	0	1432512000
94	17.78	17.82	17.26	17.68	17.52	23197897	407829184	3	2.899737	1	0.9132	0	1.031780	2015	5	22	26	1	1432598400
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1191	14.22	14.35	13.77	13.86	14.10	13857446	193912775	3	1.334800	1	-1.7021	0	1.014370	2019	11	47	22	4	1574380800
1192	13.92	14.62	13.91	14.44	13.86	33019563	474715627	3	3.180500	1	4.1847	0	1.012324	2019	11	48	25	0	1574640000
1193	14.47	14.56	14.23	14.23	14.44	14896110	213625282	3	1.434800	1	-1.4543	0	1.028050	2019	11	48	26	1	1574726400
1194	14.18	14.54	14.01	14.29	14.23	16550049	236702541	3	1.594100	1	0.4216	0	1.031558	2019	11	48	27	2	1574812800
1195	14.28	14.76	14.21	14.44	14.29	16324504	236742107	3	1.572400	1	1.0497	0	1.021047	2019	11	48	28	3	1574899200

	date	code	open	high	low	close	preclose	volume	amount	adjustflag	turn	tradestatus	pctChg	buy_flag
1281	2020-04-08	sz.002648	13.74	14.12	13.65	13.97	13.84	15477705	214734917	3	1.4908	1	0.9393	10.0
1282	2020-04-09	sz.002648	14.08	14.29	14.03	14.18	13.97	16011717	226975113	3	1.5423	1	1.5032	10.0
1283	2020-04-10	sz.002648	14.05	14.05	13.54	13.76	14.18	17320147	238855999	3	1.6694	1	-2.9619	10.0
1284	2020-04-13	sz.002648	14.19	15.14	14.00	14.96	13.76	56552045	834440561	3	5.4507	1	8.7209	10.0
1285	2020-04-14	sz.002648	15.06	15.20	14.80	14.89	14.96	42736530	638906444	3	4.1191	1	-0.4679	10.0