{"id":984,"date":"2018-01-12T06:49:00","date_gmt":"2018-01-12T06:49:00","guid":{"rendered":"https:\/\/thehive.ai\/blog\/?p=984"},"modified":"2024-07-05T07:30:00","modified_gmt":"2024-07-05T07:30:00","slug":"simple-ml-serving","status":"publish","type":"post","link":"https:\/\/thehive.ai\/blog\/simple-ml-serving","title":{"rendered":"Step-by-step Guide to Deploying Deep Learning Models"},"content":{"rendered":"\n<h2>Or, I just trained a machine learning model &#8211; now what<\/h2>\n\n\n\n<p>This post goes over a quick and dirty way to deploy a trained machine learning model to production<\/p>\n\n\n\n<h2>ML in production<\/h2>\n\n\n\n<p>When we first entered the machine learning space here at Hive, we already had millions of ground truth labeled images, allowing us to train a state-of-the-art deep convolutional image classification model from scratch (i.e. randomized weights) in under a week, specialized for our use case. The more typical ML use case, though, is usually on the order of hundreds of images, for which I would recommend fine-tuning an existing model. For instance, https:\/\/www.tensorflow.org\/tutorials\/image_retraining has a great tutorial on how to fine-tune an Imagenet model (trained on 1.2M images, 1000 classes) to classify a sample dataset of flowers (3647 images, 5 classes).<\/p>\n\n\n\n<p>For a quick tl;dr of the linked TensorFlow tutorial, after installing bazel and TensorFlow, you would need to run the following code, which takes around 30 mins to build and 5 minutes to train:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>(\ncd \"$HOME\" &amp;&amp; \\\ncurl -O http:\/\/download.tensorflow.org\/example_images\/flower_photos.tgz &amp;&amp; \\\ntar xzf flower_photos.tgz ;\n) &amp;&amp; \\\nbazel build tensorflow\/examples\/image_retraining:retrain \\\n          tensorflow\/examples\/image_retraining:label_image \\\n&amp;&amp; \\\nbazel-bin\/tensorflow\/examples\/image_retraining\/retrain \\\n  --image_dir \"$HOME\"\/flower_photos \\\n  --how_many_training_steps=200\n&amp;&amp; \\\nbazel-bin\/tensorflow\/examples\/image_retraining\/label_image \\\n  --graph=\/tmp\/output_graph.pb \\\n  --labels=\/tmp\/output_labels.txt \\\n  --output_layer=final_result:0 \\\n  --image=$HOME\/flower_photos\/daisy\/21652746_cc379e0eea_m.jpg<\/code><\/pre>\n\n\n\n<p>Alternatively, if you have <a href=\"https:\/\/www.docker.com\/products\/docker-desktop\/\" target=\"_blank\" rel=\"noreferrer noopener\">Docker<\/a> installed, you can use this <a href=\"https:\/\/hub.docker.com\/r\/liubowei\/simple-ml-serving\/\" target=\"_blank\" rel=\"noreferrer noopener\">prebuilt Docker image<\/a> like so:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo docker run -it --net=host liubowei\/simple-ml-serving:latest \/bin\/bash\n\n&gt;&gt;&gt; cat test.sh &amp;&amp; bash test.sh<\/code><\/pre>\n\n\n\n<p>which puts you into an interactive shell inside the container and runs the above command; you can also follow along with the rest of this post inside the container if you wish.<\/p>\n\n\n\n<p>Now, TensorFlow has saved the model information into <strong>\/tmp\/output_graph.pb and \/tmp\/output_labels.txt<\/strong>, which are passed above as command-line parameters to the <a href=\"https:\/\/github.com\/tensorflow\/tensorflow\/blob\/r1.4\/tensorflow\/examples\/image_retraining\/label_image.py\" target=\"_blank\" rel=\"noreferrer noopener\">label_image.py<\/a> script . Google&#8217;s image_recognition tutorial also links to <a href=\"https:\/\/github.com\/tensorflow\/models\/blob\/master\/tutorials\/image\/imagenet\/classify_image.py#L130\" target=\"_blank\" rel=\"noreferrer noopener\">another inference script<\/a>, but we will be sticking with label_image.py for now.<\/p>\n\n\n\n<p><strong>Converting one-shot inference to online inference (TensorFlow)<\/strong><\/p>\n\n\n\n<p>If we just want to accept file names from standard input, one per line, we can do &#8220;online&#8221; inference quite easily:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>while read line ; do\nbazel-bin\/tensorflow\/examples\/image_retraining\/label_image \\\n--graph=\/tmp\/output_graph.pb --labels=\/tmp\/output_labels.txt \\\n--output_layer=final_result:0 \\\n--image=\"$line\" ;\ndone<\/code><\/pre>\n\n\n\n<p>From a performance standpoint, though, this is terrible &#8211; we are reloading the neural net, the weights, the entire TensorFlow framework, and python itself, for every input example!<\/p>\n\n\n\n<p>We can do better. Let&#8217;s start by editing the label_image.py script &#8212; for me, this is located in <strong>bazel-bin\/tensorflow\/examples\/image_retraining\/label_image.runfiles\/org_tensorflow\/tensorflow\/examples\/image_retraining\/label_image.py.<\/strong><\/p>\n\n\n\n<p><strong>Let&#8217;s change the lines<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>141:  run_graph(image_data, labels, FLAGS.input_layer, FLAGS.output_layer,\n142:        FLAGS.num_top_predictions)<\/code><\/pre>\n\n\n\n<p>TO<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>141:  for line in sys.stdin:\n142:    run_graph(load_image(line), labels, FLAGS.input_layer,\n142:        FLAGS.output_layer, FLAGS.num_top_predictions)<\/code><\/pre>\n\n\n\n<p>This is indeed a lot faster, but this is still not the best we can do!<\/p>\n\n\n\n<p>The reason is the with <strong>tf.Session() as sess<\/strong> construction on line 100. TensorFlow is essentially loading all the computation into memory every time <strong>run_graph<\/strong> is called. This becomes apparent once you start trying to do inference on the GPU &#8212; you can see the GPU memory go up and down as TensorFlow loads and unloads the model parameters to and from the GPU. As far as I know, this construction is not present in other ML frameworks like Caffe or Pytorch.<\/p>\n\n\n\n<p>The solution is then to pull the <strong>with<\/strong> statement out, and pass in a <strong>sess<\/strong> variable to <strong>run_graph<\/strong>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\ndef run_graph(image_data, labels, input_layer_name, output_layer_name,\n              num_top_predictions, sess):\n    # Feed the image_data as input to the graph.\n    #   predictions will contain a two-dimensional array, where one\n    #   dimension represents the input image count, and the other has\n    #   predictions per class\n    softmax_tensor = sess.graph.get_tensor_by_name(output_layer_name)\n    predictions, = sess.run(softmax_tensor, {input_layer_name: image_data})\n    # Sort to show labels in order of confidence\n    top_k = predictions.argsort()&#91;-num_top_predictions:]&#91;::-1]\n    for node_id in top_k:\n      human_string = labels&#91;node_id]\n      score = predictions&#91;node_id]\n      print('%s (score = %.5f)' % (human_string, score))\n    return &#91; (labels&#91;node_id], predictions&#91;node_id].item()) for node_id in top_k ] # numpy floats are not json serializable, have to run item\n\n...\n\n  with tf.Session() as sess:\n    for line in sys.stdin:\n      run_graph(load_image(line), labels, FLAGS.input_layer, FLAGS.output_layer,\n          FLAGS.num_top_predictions, sess)<\/code><\/pre>\n\n\n\n<p>(see code at <a href=\"https:\/\/github.com\/hiveml\/simple-ml-serving\/blob\/master\/label_image.py\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/hiveml\/simple-ml-serving\/blob\/master\/label_image.py<\/a>)<\/p>\n\n\n\n<p>If you run this, you should find that it takes around 0.1 sec per image, quite fast enough for online use.<\/p>\n\n\n\n<h2>Converting one-shot inference to online inference (Other ML Frameworks)<\/h2>\n\n\n\n<p>Caffe uses its <strong>net.forward<\/strong> code which is very easy to put into a callable framework: see <a href=\"http:\/\/nbviewer.jupyter.org\/github\/BVLC\/caffe\/blob\/master\/examples\/00-classification.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">http:\/\/nbviewer.jupyter.org\/github\/BVLC\/caffe\/blob\/master\/examples\/00-classification.ipynb<\/a><\/p>\n\n\n\n<p>Mxnet is also very unique: it actually has ready-to-go inference server code publicly available: <a href=\"https:\/\/github.com\/awslabs\/mxnet-model-server.\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/awslabs\/mxnet-model-server.<\/a><\/p>\n\n\n\n<p>Further details coming soon!<\/p>\n\n\n\n<h2>Deployment<\/h2>\n\n\n\n<p>The plan is to wrap this code in a Flask app. If you haven&#8217;t heard of it, Flask is a very lightweight Python web framework which allows you to spin up an http api server with minimal work.<\/p>\n\n\n\n<p>As a quick reference, here&#8217;s a Flask app that receives POST requests with multipart form data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/usr\/bin\/env python\n# usage: python echo.py to launch the server ; and then in another session, do\n# curl -v -XPOST 127.0.0.1:12480 -F \"data=@.\/image.jpg\"\nfrom flask import Flask, request\napp = Flask(__name__)\n@app.route('\/', methods=&#91;'POST'])\ndef classify():\n    try:\n        data = request.files.get('data').read()\n        print repr(data)&#91;:1000]\n        return data, 200\n    except Exception as e:\n        return repr(e), 500\napp.run(host='127.0.0.1',port=12480)<\/code><\/pre>\n\n\n\n<p>And here is the corresponding Flask app hooked up to <strong>run_graph<\/strong> above:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\n#!\/usr\/bin\/env python\n# usage: bash tf_classify_server.sh\nfrom flask import Flask, request\nimport tensorflow as tf\nimport label_image as tf_classify\nimport json\napp = Flask(__name__)\nFLAGS, unparsed = tf_classify.parser.parse_known_args()\nlabels = tf_classify.load_labels(FLAGS.labels)\ntf_classify.load_graph(FLAGS.graph)\nsess = tf.Session()\n@app.route('\/', methods=&#91;'POST'])\ndef classify():\n    try:\n        data = request.files.get('data').read()\n        result = tf_classify.run_graph(data, labels, FLAGS.input_layer, FLAGS.output_layer, FLAGS.num_top_predictions, sess)\n        return json.dumps(result), 200\n    except Exception as e:\n        return repr(e), 500\napp.run(host='127.0.0.1',port=12480)<\/code><\/pre>\n\n\n\n<p>This looks quite good, except for the fact that Flask and TensorFlow are both fully synchronous &#8211; Flask processes one request at a time in the order they are received, and TensorFlow fully occupies the thread when doing the image classification.<\/p>\n\n\n\n<p>As it&#8217;s written, the speed bottleneck is probably still in the actual computation work, so there&#8217;s not much point upgrading the Flask wrapper code. And maybe this code is sufficient to handle your load, for now.<\/p>\n\n\n\n<p>There are 2 obvious ways to scale up request thoroughput : scale up horizontally by increasing the number of workers, which is covered in the next section, or scale up vertically by utilizing a GPU and batching logic. Implementing the latter requires a webserver that is able to handle multiple pending requests at once, and decide whether to keep waiting for a larger batch or send it off to the TensorFlow graph thread to be classified, for which this Flask app is horrendously unsuited. Two possibilities are using Twisted + Klein for keeping code in Python, or Node.js + ZeroMQ if you prefer first class event loop support and the ability to hook into non-Python AI frameworks such as Torch.<\/p>\n\n\n\n<p>OK, so now we have a single server serving our model, but maybe it&#8217;s too slow or our load is getting too high. We&#8217;d like to spin up more of these servers &#8211; how can we distribute requests across each of them?<\/p>\n\n\n\n<p>The ordinary method is to add a proxy layer, perhaps haproxy or nginx, which balances the load between the backend servers while presenting a single uniform interface to the client. For use later in this section, here is some sample code that runs a rudimentary Node.js load balancer http proxy:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Usage : node basic_proxy.js WORKER_PORT_0,WORKER_PORT_1,...\nconst worker_ports = process.argv&#91;2].split(',')\nif (worker_ports.length === 0) { console.err('missing worker ports') ; process.exit(1) }\n\nconst proxy = require('http-proxy').createProxyServer({})\nproxy.on('error', () =&gt; console.log('proxy error'))\n\nlet i = 0\nrequire('http').createServer((req, res) =&gt; {\n  proxy.web(req,res, {target: 'http:\/\/localhost:' + worker_ports&#91; (i++) % worker_ports.length ]})\n}).listen(12480)\nconsole.log(`Proxying localhost:${12480} to &#91;${worker_ports.toString()}]`)\n\n\/\/ spin up the AI workers\nconst { exec } = require('child_process')\nworker_ports.map(port =&gt; exec(`\/bin\/bash .\/tf_classify_server.sh ${port}`))<\/code><\/pre>\n\n\n\n<p>To automatically detect how many backend servers are up and where they are located, people generally use a &#8220;service discovery&#8221; tool, which may be bundled with the load balancer or be separate. Some well-known ones are Consul and Zookeeper. Setting up and learning how to use one is beyond the scope of this article, so I&#8217;ve included a very rudimentary proxy using the node.js service discovery package seaport.<\/p>\n\n\n\n<p><strong>Proxy code:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Usage : node seaport_proxy.js\nconst seaportServer = require('seaport').createServer()\nseaportServer.listen(12481)\nconst proxy = require('http-proxy').createProxyServer({})\nproxy.on('error', () =&gt; console.log('proxy error'))\n\nlet i = 0\nrequire('http').createServer((req, res) =&gt; {\n  seaportServer.get('tf_classify_server', worker_ports =&gt; {\n    const this_port = worker_ports&#91; (i++) % worker_ports.length ].port\n    proxy.web(req,res, {target: 'http:\/\/localhost:' + this_port })\n  })\n}).listen(12480)\nconsole.log(`Seaport proxy listening on ${12480} to '${'tf_classify_server'}' servers registered to ${12481}`)<\/code><\/pre>\n\n\n\n<p><strong>Worker code:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Usage : node tf_classify_server.js\nconst port = require('seaport').connect(12481).register('tf_classify_server')\nconsole.log(`Launching tf classify worker on ${port}`)\nrequire('child_process').exec(`\/bin\/bash .\/tf_classify_server.sh ${port}`)<\/code><\/pre>\n\n\n\n<p>However, as applied to AI, this setup runs into a bandwidth problem.<\/p>\n\n\n\n<p>At anywhere from tens to hundreds of images a second, the system becomes bottlenecked on network bandwidth. In the current setup, all the data has to go through our single seaport master, which is the single endpoint presented to the client.<\/p>\n\n\n\n<p>To solve this, we need our clients to not hit the single endpoint at <strong>http:\/\/127.0.0.1:12480<\/strong>, but instead to automatically rotate between backend servers to hit. If you know some networking, this sounds precisely like a job for DNS!<\/p>\n\n\n\n<p>However, setting up a custom DNS server is again beyond the scope of this article. Instead, by changing the clients to follow a 2-step &#8220;manual DNS&#8221; protocol, we can reuse our rudimentary seaport proxy to implement a &#8220;peer-to-peer&#8221; protocol in which clients connect directly to their servers:<\/p>\n\n\n\n<p><strong>Proxy code:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Usage : node p2p_proxy.js\nconst seaportServer = require('seaport').createServer()\nseaportServer.listen(12481)\n\nlet i = 0\nrequire('http').createServer((req, res) =&gt; {\n  seaportServer.get('tf_classify_server', worker_ports =&gt; {\n    const this_port = worker_ports&#91; (i++) % worker_ports.length ].port\n    res.end(`${this_port}\n`)\n  })\n}).listen(12480)\nconsole.log(`P2P seaport proxy listening on ${12480} to 'tf_classify_server' servers registered to ${12481}`)<\/code><\/pre>\n\n\n\n<p>(The worker code is the same as above.)<\/p>\n\n\n\n<p><strong>Client code:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\ncurl -v -XPOST localhost:`curl localhost:12480` -F\"data=@$HOME\/flower_photos\/daisy\/21652746_cc379e0eea_m.jpg\n        <\/code><\/pre>\n\n\n\n<h2>RPC Deployment<\/h2>\n\n\n\n<p>Coming soon! A version of the above with Flask replaced by ZeroMQ.<\/p>\n\n\n\n<h2>Conclusion and further reading<\/h2>\n\n\n\n<p>At this point you should have something working in production, but it&#8217;s certainly not futureproof. There are several important topics that were not covered in this guide:<\/p>\n\n\n\n<ul><li>Automatically deploying and setting up on new hardware.<ul><li>Notable tools include Openstack\/VMware if you&#8217;re on your own hardware, Chef\/Puppet for installing Docker and handling networking routes, and Docker for installing TensorFlow, Python, and everything else.<\/li><li>Kubernetes or Marathon\/Mesos are also great if you&#8217;re on the cloud<\/li><\/ul><\/li><li>Model version management<ul><li>Not too hard to handle this manually at first<\/li><li>TensorFlow Serving is a great tool that handles this, as well as batching and overall deployment, very thoroughly. The downsides are that it&#8217;s a bit hard to setup and to write client code for, and in addition doesn&#8217;t support Caffe\/PyTorch.<\/li><\/ul><\/li><li>How to migrate your AI code off Matlab<ul><li>Don&#8217;t do matlab in production.<\/li><\/ul><\/li><li>GPU drivers, Cuda, CUDNN<ul><li>Use nvidia-docker and try to find some Dockerfiles online.<\/li><\/ul><\/li><li>Postprocessing layers. Once you get a few different AI models in production, you might start wanting to mix and match them for different use cases &#8212; run model A only if model B is inconclusive, run model C in Caffe and pass the results to model D in TensorFlow, etc.<\/li><\/ul>\n\n\n\n<p><span style=\"color: initial;\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post goes over a step-by-step guide to quickly deploy a trained machine learning model to production.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"kia_subtitle":""},"categories":[8],"tags":[],"_links":{"self":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/984"}],"collection":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/comments?post=984"}],"version-history":[{"count":3,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/984\/revisions"}],"predecessor-version":[{"id":1057,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/posts\/984\/revisions\/1057"}],"wp:attachment":[{"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/media?parent=984"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/categories?post=984"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thehive.ai\/blog\/wp-json\/wp\/v2\/tags?post=984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}