LSTM 模型预测 A 股走势

这几天闲来无事,看网上很多人都在评论股市的起起伏伏,于是心血来潮,想用深度学习看看能否预测股票走势。由于之前研究方向都是 CV 和 NAS,也想借用这个机会深入了解下 RNN 这个派系是怎么回事。

数日后


经过我多日的观察,LSTM 似乎并不能预测 A 股走势。当然由于我在这方面还是初学者,所以代码有可能存在问题,欢迎大家指出!

下载数据集

证券宝 www.baostock.com 是一个免费、开源的证券数据平台(无需注册)。其优点请访问官网

取消下面代码注释来安装 baostock 库

 

按照官网教程下载一只股票的k线数据,这里取sz.002648 15年以后的数据。

import baostock as bs
import pandas as pd

#### 登陆系统 ####
lg = bs.login()

#### 获取沪深A股历史K线数据 ####
# 详细指标参数,参见“历史行情指标参数”章节;“分钟线”参数与“日线”参数不同。
# 分钟线指标:date,time,code,open,high,low,close,volume,amount,adjustflag
rs = bs.query_history_k_data_plus("sz.002648",
    "date,code,open,high,low,close,preclose,volume,amount,adjustflag,turn,tradestatus,pctChg,isST",
    start_date='2015-01-01', end_date='2020-4-14',
    frequency="d", adjustflag="3")

#### 打印结果集 ####
data_list = []
while (rs.error_code == '0') & rs.next():
    # 获取一条记录,将记录合并在一起
    data_list.append(rs.get_row_data())
df = pd.DataFrame(data_list, columns=rs.fields)

#### 结果集输出到csv文件 ####   
# result.to_csv("D:\\history_A_stock_k_data.csv", index=False)
print(df)

#### 登出系统 ####
bs.logout()
login success!
            date       code     open     high      low    close preclose  \
0     2015-01-05  sz.002648  12.0400  12.4700  11.8000  12.2200  12.1000   
1     2015-01-06  sz.002648  12.2000  12.3600  12.0000  12.2900  12.2200   
2     2015-01-07  sz.002648  12.2900  12.5600  12.2500  12.4000  12.2900   
3     2015-01-08  sz.002648  12.4300  12.6300  12.3000  12.4700  12.4000   
4     2015-01-09  sz.002648  12.4600  12.7400  12.3900  12.4100  12.4700   
...          ...        ...      ...      ...      ...      ...      ...   
1281  2020-04-08  sz.002648  13.7400  14.1200  13.6500  13.9700  13.8400   
1282  2020-04-09  sz.002648  14.0800  14.2900  14.0300  14.1800  13.9700   
1283  2020-04-10  sz.002648  14.0500  14.0500  13.5400  13.7600  14.1800   
1284  2020-04-13  sz.002648  14.1900  15.1400  14.0000  14.9600  13.7600   
1285  2020-04-14  sz.002648  15.0600  15.2000  14.8000  14.8900  14.9600   

        volume          amount adjustflag      turn tradestatus     pctChg  \
0     16718083  201474357.0000          3  2.089760           1   0.991700   
1      7965196   97122442.0000          3  0.995649           1   0.572800   
2      7684398   95370790.0000          3  0.960550           1   0.895000   
3      8056201  100447582.0000          3  1.007025           1   0.564500   
4      7648582   96383568.0000          3  0.956073           1  -0.481200   
...        ...             ...        ...       ...         ...        ...   
1281  15477705  214734917.0800          3  1.490800           1   0.939300   
1282  16011717  226975113.8300          3  1.542300           1   1.503200   
1283  17320147  238855999.3500          3  1.669400           1  -2.961900   
1284  56552045  834440561.6200          3  5.450700           1   8.720900   
1285  42736530  638906444.8700          3  4.119100           1  -0.467900   

     isST  
0       0  
1       0  
2       0  
3       0  
4       0  
...   ...  
1281    0  
1282    0  
1283    0  
1284    0  
1285    0  

[1286 rows x 14 columns]
logout success!
<baostock.data.resultset.ResultData at 0x11cdb33d0>

下载的数据都是以字符串形式保存的,我们把需要的数据转换成整数和浮点数

float_type = ['open','high','low','close','preclose','amount','pctChg']

for item in float_type:
    df[item] = df[item].astype('float')

df['amount'] = df['amount'].astype('int')
df['volume'] = df['volume'].astype('int')
df['turn'] = [0 if x == "" else float(x) for x in df["turn"]]
df['buy_flag'] = 10

# df.tail()

处理数据集

用 LSTM 预测价格显示是不合理的,因为价格的波动非常不可控,所以我们退而求其次,预测股票的走势,即涨还是跌。

但是怎么量化股票的涨跌是个问题,本人之前完全没接触过股票,所以这里就想当然用未来数天的平均股价表示股票的起伏。

def MA_next(df, date_idx, price_type, n): 
    return df[price_type][date_idx:date_idx+n].mean()

假设短期2天,中期6天,长期15天。如果未来15天平均价格大于未来6天平均价格大于未来2天平均价格,我们就可认为未来15天的股市走势很好。 这里还要求有3%的涨幅,能一定程度上减少标签频繁波动。

'2'含义为买入,'0'含义为卖出,'1'为默认值

s_time = 2
m_time = 6
l_time = 15

for i in range(len(df)-l_time):
    if MA_next(df,i,'close',l_time)>MA_next(df,i,'close',m_time)*1.03>MA_next(df,i,'close',s_time)*1.03:
        df.loc[i, 'buy_flag'] = 2
    elif MA_next(df,i,'close',s_time)>MA_next(df,i,'close',m_time):
        df.loc[i, 'buy_flag'] = 0
    else:
        df.loc[i, 'buy_flag'] = 1
        df.loc[i, 'buy_flag'] = 1 + (MA_next(df,i,'close',m_time)-MA_next(df,i,'close',s_time))/MA_next(df,i,'close',s_time)
#     df.loc[i, 'buy_flag'] = 10*(MA_next(df,i,'close',m_time)+MA_next(df,i,'close',l_time)-2*MA_next(df,i,'close',s_time))/MA_next(df,i,'close',s_time)
        
df.tail()
date code open high low close preclose volume amount adjustflag turn tradestatus pctChg isST buy_flag
1281 2020-04-08 sz.002648 13.74 14.12 13.65 13.97 13.84 15477705 214734917 3 1.4908 1 0.9393 0 10.0
1282 2020-04-09 sz.002648 14.08 14.29 14.03 14.18 13.97 16011717 226975113 3 1.5423 1 1.5032 0 10.0
1283 2020-04-10 sz.002648 14.05 14.05 13.54 13.76 14.18 17320147 238855999 3 1.6694 1 -2.9619 0 10.0
1284 2020-04-13 sz.002648 14.19 15.14 14.00 14.96 13.76 56552045 834440561 3 5.4507 1 8.7209 0 10.0
1285 2020-04-14 sz.002648 15.06 15.20 14.80 14.89 14.96 42736530 638906444 3 4.1191 1 -0.4679 0 10.0

可视化

使用 plotly 绘图

import plotly.graph_objects as go
# from IPython.display import HTML
import chart_studio.plotly as py

fig = go.Figure(data=[go.Candlestick(x=df['date'],
                open=df['open'], high=df['high'],
                low=df['low'], close=df['close'],
                increasing_line_color= 'red', decreasing_line_color= 'green')
                     ])

fig.add_trace(go.Scatter(x=df['date'],y=df['buy_flag'], name='Flag'))

fig.update_layout(
    xaxis_range=['2017-01-01','2019-12-31'],
    yaxis_title='Price',
#     xaxis_rangeslider_visible=False,
)

py.iplot(fig, filename="stock-price")

在 Fast.ai Part 1 课程中,提到一个能扩展日期特征的函数add_datepart,该函数能计算当前日期的年、月、日、一周第几天、周数、月初月末、一年当中的第几天等信息。我们用该函数扩展日期特征。

from fastai.tabular import *
add_datepart(df, "date", drop=False)
seq_length = 90
train_df = df[seq_length:-seq_length]
# 丢掉不重要的特征
train_df = train_df.drop(['date','code','Is_month_end', 'Is_month_start', 'Is_quarter_end',
                          'Is_quarter_start', 'Is_year_end', 'Is_year_start','Dayofyear'],axis=1)
train_df
open high low close preclose volume amount adjustflag turn tradestatus pctChg isST buy_flag Year Month Week Day Dayofweek Elapsed
90 16.15 16.85 16.07 16.28 16.19 20654461 340943328 3 2.581808 1 0.5559 0 2.000000 2015 5 21 20 2 1432080000
91 16.45 16.84 16.30 16.81 16.28 21049050 349775744 3 2.631131 1 3.2555 0 2.000000 2015 5 21 21 3 1432166400
92 16.91 17.50 16.72 17.25 16.81 26445834 455184800 3 3.305729 1 2.6175 0 2.000000 2015 5 21 22 4 1432252800
93 17.20 17.72 16.93 17.52 17.25 25543119 444374592 3 3.192890 1 1.5652 0 1.070076 2015 5 22 25 0 1432512000
94 17.78 17.82 17.26 17.68 17.52 23197897 407829184 3 2.899737 1 0.9132 0 1.031780 2015 5 22 26 1 1432598400
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1191 14.22 14.35 13.77 13.86 14.10 13857446 193912775 3 1.334800 1 -1.7021 0 1.014370 2019 11 47 22 4 1574380800
1192 13.92 14.62 13.91 14.44 13.86 33019563 474715627 3 3.180500 1 4.1847 0 1.012324 2019 11 48 25 0 1574640000
1193 14.47 14.56 14.23 14.23 14.44 14896110 213625282 3 1.434800 1 -1.4543 0 1.028050 2019 11 48 26 1 1574726400
1194 14.18 14.54 14.01 14.29 14.23 16550049 236702541 3 1.594100 1 0.4216 0 1.031558 2019 11 48 27 2 1574812800
1195 14.28 14.76 14.21 14.44 14.29 16324504 236742107 3 1.572400 1 1.0497 0 1.021047 2019 11 48 28 3 1574899200

1106 rows × 19 columns

接下来我们为数据生成序列,用前seq_length天的信息作为输入序列,后1天的股市起伏buy_flag作为标签

def sliding_windows(data, label, seq_length):
    x = []
    y = []

    for i in range(len(data)-seq_length-1):
        _x = data[i:(i+seq_length)]
        _y = label[i+seq_length]
        x.append(_x)
        y.append(_y)

    return np.array(x),np.array(y)

在 Pytorch 中,LSTM 默认的输入顺序是 seq_length*batch_size*feature,而我们通常生成的序列是batch_size*seq_length*feature,因此需要交换下输入数据纬度。

from sklearn.preprocessing import MinMaxScaler
import numpy as np

y_scaler = MinMaxScaler()
x_scaler = MinMaxScaler()

#converting dataset into x_train and y_train
X = train_df.drop(['buy_flag'],axis=1).values
X = x_scaler.fit_transform(X)
Y = train_df['buy_flag']
Y = np.array(Y).reshape(-1,1)
Y = y_scaler.fit_transform(Y)

x, y = sliding_windows(X, Y, seq_length)

y_train,y_test = y[:int(y.shape[0]*0.8)],y[int(y.shape[0]*0.8):]
x_train,x_test = x[:int(x.shape[0]*0.8)],x[int(x.shape[0]*0.8):]

# lstm: seq, batch, feature
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

dataX = torch.Tensor(x.transpose(1,0,2))
dataY = torch.Tensor(y)
trainX = torch.Tensor(x_train.transpose(1,0,2))
trainY = torch.Tensor(y_train)
testX = torch.Tensor(x_test.transpose(1,0,2))
testY = torch.Tensor(y_test)
trainX.shape, trainY.shape
(torch.Size([90, 812, 18]), torch.Size([812, 1]))

建立 LSTM 模型

class LSTM(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(LSTM, self).__init__()
        
        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.seq_length = seq_length
        
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size,
                            num_layers=num_layers)
        
        self.fc = nn.Linear(hidden_size, num_classes)
        

    def forward(self, x):
        # 不手动指定 h 和 c 的话,默认就是 0
#         h_0 = torch.zeros(
#             self.num_layers, x.size(0), self.hidden_size)
        
#         c_0 = torch.zeros(
#             self.num_layers, x.size(0), self.hidden_size)
        
        # Propagate input through LSTM
#         ula, (h_out, _) = self.lstm(x, (h_0, c_0))
        ula, (h_out, _) = self.lstm(x)
        
        h_out = h_out.view(-1, self.hidden_size)
        
        out = self.fc(h_out)
        
        return out

训练模型

num_epochs = 15
learning_rate = 0.001

input_size = train_df.shape[1]-1 # The number of expected features in the input x
hidden_size = 300 # The number of features in the hidden state h
num_layers = 1 # Number of recurrent layers.

num_classes = 1 # output

lstm = LSTM(num_classes, input_size, hidden_size, num_layers)

criterion = torch.nn.MSELoss()    # mean-squared error for regression
optimizer = torch.optim.Adam(lstm.parameters(), lr=learning_rate)
#optimizer = torch.optim.SGD(lstm.parameters(), lr=learning_rate)

# Train the model
lstm.train()
lstm.to(device)
trainX = trainX.to(device)
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = lstm(trainX)
    
    # obtain the loss function
    loss = criterion(outputs, trainY)
    
    loss.backward()
    
    optimizer.step()
    print("Epoch: %d, loss: %1.5f" % (epoch, loss.item()))
Epoch: 0, loss: 0.22898
Epoch: 1, loss: 0.17325
Epoch: 2, loss: 0.14031
Epoch: 3, loss: 0.13380
Epoch: 4, loss: 0.14764
Epoch: 5, loss: 0.14571
Epoch: 6, loss: 0.13712
Epoch: 7, loss: 0.13153
Epoch: 8, loss: 0.12988
Epoch: 9, loss: 0.13052
Epoch: 10, loss: 0.13184
Epoch: 11, loss: 0.13284
Epoch: 12, loss: 0.13308
Epoch: 13, loss: 0.13251
Epoch: 14, loss: 0.13131

查看训练效果

import plotly.graph_objects as go

lstm.eval()
lstm.to(torch.device('cpu'))
with torch.no_grad():
    dataY_pred = lstm(dataX)

dataY_pred = dataY_pred.data.numpy()
dataY_truth = dataY.data.numpy()

dataY_pred = y_scaler.inverse_transform(dataY_pred)
dataY_truth = y_scaler.inverse_transform(dataY_truth)


fig = go.Figure(go.Scatter(y=dataY_truth.flatten(),name='Ground Truth'))
fig.add_trace(go.Scatter(y=dataY_pred.flatten(),name='Predicted'))

fig.update_layout(
    shapes = [dict(
        x0=len(x_train), x1=len(x_train), y0=0, y1=1, xref='x', yref='paper',
        line_width=2)], #在图上划分训练集和测试集
    xaxis_rangeslider_visible=True,
)
py.iplot(fig, filename="stock-result")

发现预测值基本取标签的平均值,也就是说它并不会根据输入调整输出,而是直接输出标签的平均值,没有任何参考价值

import random
i = random.randint(0,testX.shape[1])
with torch.no_grad():
    y_pred = lstm(testX[:,i,::].reshape(testX.shape[0],1,testX.shape[2]))
print('预测值:{0}, 实际值:{1}'.format(y_pred.data.numpy(),testY[i].reshape(-1,1)))
预测值:[[0.288591]], 实际值:tensor([[1.]])