Publications
  • Conference
  • Journal
  • arXiv
  • Workshop
  • Poster

DistStream: An Order-Aware Distributed Framework for Online-Offline Stream Clustering Algorithms

Lijie Xu, Xingtong Ye, Kai Kang, Tian Guo, Wensheng Dou, Wei Wang and Jun Wei
40th IEEE International Conference on Distributed Computing Systems (ICDCS'20)

Stream clustering is an important data mining tech- nique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online- offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, can not be efficiently parallelized and fail to deliver the required high throughput for stream clustering.
In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with several efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that Dist- Stream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Shijian Li, Robert Walls and Tian Guo
40th IEEE International Conference on Distributed Computing Systems(ICDCS'20)

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large- scale datasets. However, it is challenging to determine the appro- priate cluster configuration—e.g., server type and number—for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.
In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud- based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

QuRate: Power-Efficient Mobile Immersive Video Streaming

Nan Jiang, Yao Liu, Tian Guo, Wenyao Xu, Viswanathan Swaminathan, Lisong Xu, and Sheng Wei
ACM Multimedia Systems Conference 2020 (MMSys'20)

Commodity smartphones have recently become a popular platform for deploying the computation-intensive virtual reality (VR) applications. Among the variety of VR applications, immersive video streaming (a.k.a., 360-degree video streaming) is one of the first commercial use cases deployed at scale. One specific challenge involving the smartphone-based head mounted display (HMD) is to reduce the potentially huge power consumption caused by the immersive video and minimize its mismatch with the constrained battery capacity. To address this challenge, we first conduct an empirical power measurement study on a typical smartphone immersive streaming system, which identifies the major power consumption sources. Then, based on the insights drawn from the measurement study, we propose and develop QuRate, a quality-aware and user-centric frame rate adaptation mechanism to tackle the power consumption issue in immersive video streaming on smartphones. QuRate optimizes the immersive video power consumption by modeling the correlation between the perceivable video quality and the user behavior. Specifically, QuRate builds on top of the user's reduced level of concentration on the video frames during view switching and dynamically adjusts the frame rate without impacting the perceivable video quality. We evaluate QuRate with an Institutional Review Board (IRB)-approved subjective user study to validate its minimum video quality impact. Also, we conduct a comprehensive set of power evaluations involving 5 smartphones, 21 users, and 6 immersive videos with empirical user head movement traces from a publicly available dataset. Our experimental results demonstrate significant power savings and, in particular, QuRate is capable of extending the smartphone battery life by up to 1.24X while maintaining the perceivable video quality during immersive video streaming.

DRAB-LOCUS: An Area-Efficient AES Architecture for Hardware Accelerator Co-Location on FPGAs

Jacob T. Grycel and Robert J. Walls
IEEE International Symposium on Circuits and Systems (ISCAS'20)

Advanced Encryption Standard (AES) implementations on Field Programmable Gate Arrays (FPGA) commonly focus on maximizing throughput at the cost of utilizing high volumes of FPGA slice logic. High resource usage limits systems' abilities to implement other functions (such as video processing or machine learning) that may want to share the same FPGA resources. In this paper, we address the shared resource challenge by proposing and evaluating a low-area, but high-throughput, AES architecture. In contrast to existing work, our DSP/RAM-Based Low-CLB Usage (DRAB-LOCUS) architecture leverages block RAM tiles and Digital Signal Processing (DSP) slices to implement the AES Sub Bytes, Mix Columns, and Add Round Key sub-round transformations, reducing resource usage by a factor of 3 over traditional approaches. To achieve area-efficiency, we built an inner-pipelined architecture using the internal registers of block RAM tiles and DSP slices. Our DRAB-LOCUS architecture features a 12-stage pipeline capable of producing 7.055 Gbps of interleaved encrypted or decrypted data, and only uses 909 Look Up tables, 593 Flip Flops, 16 block RAMs, and 18 DSP slices in the target device.

Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

Matthew LeMay, Shijian Li, Tian Guo
2020 IEEE International Conference on Cloud Engineering (IC2E'20)

Deep learning models are increasingly used for end-user applications, supporting both novel features, such as facial recognition, and traditional features, such as web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. Under-utilized server resources brought about by dynamic workloads can influence provisioning decisions, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present Perseus, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented Perseus in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging Perseus, we evaluated the inference throughput and cost for various serving deployments and demonstrated that multi-tenant model serving can lead to up to 12% cost reduction.

MDInference: Balancing Inference Accuracy and Latency for Mobile Applications

Samuel S. Ogden and Tian Guo
2020 IEEE International Conference on Cloud Engineering (IC2E'20), Invited paper

Deep Neural Networks (DNNs) are allowing mobile devices to incorporate a wide range of features into user applications. However, the computational complexity of these models makes it difficult to run them efficiently on resource- constrained mobile devices. Prior work has started to address aspects of supporting deep learning in mobile applications either by decreasing execution latency or resorting to powerful cloud servers. As prior approaches only focuses on single aspects of mobile inference, they often fall short in delivering the desired performance.
In this work we introduce a holistic approach to designing mobile deep inference frameworks. We first identify the key goals of accuracy and latency for mobile deep inference, and the conditions that must be met to achieve them. We demonstrate our holistic approach through the design of a hypothetical framework called MDInference. This framework leverages two complementary techniques; a model selection algorithm that chooses from a set of cloud-based deep learning models and an on-device request duplication mechanism.
Through empirically-driven simulations, we show that MD- Inference achieves an increase in accuracy without impacting its ability to satisfy Service Level Agreements (SLAs). Specifically, we show that MDInference improves aggregate accuracy over static approaches by 40% without incurring SLA violations. Additionally, we show that with SLA = 250ms, MDInference can increase the aggregate accuracy in 99.74% of cases on faster university networks and 96.84% of cases on residential networks.

Poster: PointAR: Efficient Lighting Estimation for Mobile Augmented Reality

Yiqin Zhao and Tian Guo
The 21st International Workshop on Mobile Computing Systems and Applications (HotMobile'20)

In this poster, we describe the problem of lighting esti- mation in the context of mobile augmented reality (AR) ap- plications and our proposed solution. Lighting estimation refers to recovering scene lighting with limited scene obser- vation and is critical for realistic 3D rendering. As a long- standing challenge in the fields of both computer vision and computer graphics, the difficulty of lighting estimation is exacerbated for mobile AR scenarios. When interacting with mobile AR applications, users would trigger the place- ment of virtual 3D objects into any position or orientation in their surrounding environments. In order to present a more realistic effect, such objects need to be rendered with appro- priate lighting information. However, lighting, especially in the indoor scenes, can vary both spatially and temporally.

Challenges and Opportunities of DNN Model Execution Caching

Guin R. Gilman, Samuel S. Ogden, Robert J. Walls, Tian Guo
Workshop on Distributed Infrastructures for Deep Learning

We explore the opportunities and challenges of model execution caching, a nascent research area that promises to improve the performance of cloud-based deep inference serving. Broadly, model execution caching relies on servers that are geographically close to the end-device to service inference requests, resembling a traditional content delivery network (CDN). However, unlike a CDN, such schemes cache execution rather than static objects. We identify the key challenges inherent to this problem domain and describe the similarities and differences with existing caching techniques. We further introduce several emergent concepts unique to this domain, such as memory-adaptive models and multi-model hosting, which allow us to make dynamic adjustments to the memory requirements of model execution.

Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

Matthew LeMay, Shijian Li, Tian Guo
arXiv:1912.02322

Deep learning models are increasingly used for end-user applications, supporting both novel features, such as facial recognition, and traditional features, such as web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. Under-utilized server resources brought about by dynamic workloads can influence provisioning decisions, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present Perseus, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented Perseus in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging Perseus, we evaluated the inference throughput and cost for various serving deployments and demonstrated that multi-tenant model serving can lead to up to 12% cost reduction.

ERHARD-RNG: A Random Number Generator Built from Repurposed Hardware in Embedded Systems

Jacob T. Grycel and Robert J. Walls
arxiv:1903.09365

Quality randomness is fundamental to cryptographic operations but on embedded systems good sources are (seemingly) hard to find. Rather than use expensive custom hardware, our ERHARD-RNG Pseudo-Random Number Generator (PRNG) utilizes entropy sources that are already common in a range of low-cost embedded platforms. We empirically evaluate the entropy provided by three sources---SRAM startup state, oscillator jitter, and device temperature---and integrate those sources into a full Pseudo-Random Number Generator implementation based on Fortuna. Our system addresses a number of fundamental challenges affecting random number generation on embedded systems. For instance, we propose SRAM startup state as a means to efficiently generate the initial seed---even for systems that do not have writeable storage. Further, the system's use of oscillator jitter allows for the continuous collection of entropy-generating events---even for systems that do not have the user-generated events that are commonly used in general-purpose systems for entropy, e.g., key presses or network events.

Silhouette: Efficient Intra-Address Space Isolation for Protected Shadow Stacks on Embedded Systems

Jie Zhou, Yufei Du, Lele Ma, Zhuojia Shen, John Criswell, Robert J. Walls
arxiv:1910.12157

Embedded systems are increasingly deployed in devices that can have physical consequences if compromised by an attacker - including automobile control systems, smart locks, drones, and implantable medical devices. Due to resource and execution-time constraints, C is the primary programming language for writing both operating systems and bare-metal applications on these devices. Unfortunately, C is neither memory safe nor type safe.In this paper, we present an efficient intra-address space isolation technique for embedded ARM processors that leverages unprivileged store instructions. Using a compiler transformation, dubbed store promotion, which transforms regular stores into unprivileged equivalents, we can restrict a subset of the program's memory accesses to unprivileged memory while simultaneously allowing security instrumentation to access security-critical data structures (e.g., a shadow stack) - all without the need for expensive context switching. Using store promotion, we built Silhouette, a software defense that mitigates control-flow hijacking attacks using a combination of hardware configuration, runtime instrumentation, and code transformation. Silhouette enforces Control-Flow Integrity and provides an incorruptible shadow stack for return addresses. We implemented Silhouette on an ARMv7-M board and our evaluation shows that Silhouette incurs an arithmetic mean of 9.23% performance overhead and 16.93% code size overhead. Furthermore, we built Silhouette-Invert, an alternative implementation of Silhouette, which incurs just 2.54% performance overhead and 5.40% code size overhead, at the cost of a minor hardware change.

Account Lockouts: Characterizing and Preventing Account Denial-of-Service Attacks

Yu Liu, Matthew R. Squires, Curtis R. Taylor, Robert J. Walls, Craig A. Shue
Conference on Security and Privacy in Communication Networks (SecureComm)

To stymie password guessing attacks, many systems lock an account after a given number of failed authentication attempts, preventing access even if proper credentials are later provided. Combined with the proliferation of single sign-on providers, adversaries can use relatively few resources to launch large-scale application-level denial-of-service attacks against targeted user accounts by deliberately providing incorrect credentials across multiple authentication attempts. In this paper, we measure the extent to which this vulnerability exists in production systems. We focus on Microsoft services, which are used in many organizations, to identify exposed authentication points. We measure 2,066 organizations and found between 58% and 77% of organizations expose authentication portals that are vulnerable to account lockout attacks. Such attacks can be completely successful with only 13 KBytes/second of attack traffic. We then propose and evaluate a set of lockout bypass mechanisms for legitimate users. Our performance and security evaluation shows these solutions are effective while introducing little overhead to the network and systems.

Characterizing the Deep Neural Networks Inference Performance of Mobile Applications

Samuel Ogden, Tian Guo
arXiv:1909.04783

Today's mobile applications are increasingly leveraging deep neural networks to provide novel features, such as image and speech recognitions. To use a pre-trained deep neural network, mobile developers can either host it in a cloud server, referred to as cloud-based inference, or ship it with their mobile application, referred to as on-device inference. In this work, we investigate the inference performance of these two common approaches on both mobile devices and public clouds, using popular convolutional neural networks. Our measurement study suggests the need for both on-device and cloud-based inferences for supporting mobile applications. In particular, newer mobile devices is able to run mobile-optimized CNN models in reasonable time. However, for older mobile devices or to use more complex CNN models, mobile applications should opt in for cloud-based inference. We further demonstrate that variable network conditions can lead to poor cloud-based inference end-to-end time. To support efficient cloud-based inference, we propose a CNN model selection algorithm called CNNSelect that dynamically selects the most appropriate CNN model for each inference request, and adapts its selection to match different SLAs and execution time budgets that are caused by variable mobile environments. The key idea of CNNSelect is to make inference speed and accuracy trade-offs at runtime using a set of CNN models. We demonstrated that CNNSelect smoothly improves inference accuracy while maintaining SLA attainment in 88.5% more cases than a greedy baseline.

ModiPick: SLA-aware Accuracy Optimization For Mobile Deep Inference

Samuel S. Ogden, Tian Guo
arXiv:1909.02053

Mobile applications are increasingly leveraging complex deep learning models to deliver features, e.g., image recognition, that require high prediction accuracy. Such models can be both computation and memory-intensive, even for newer mobile devices, and are therefore commonly hosted in powerful remote servers. However, current cloud-based inference services employ static model selection approach that can be suboptimal for satisfying application SLAs (service level agreements), as they fail to account for inherent dynamic mobile environment. We introduce a cloud-based technique called ModiPick that dynamically selects the most appropriate model for each inference request, and adapts its selection to match different SLAs and execution time budgets that are caused by variable mobile environments. The key idea of ModiPick is to make inference speed and accuracy trade-offs at runtime with a pool of managed deep learning models. As such, ModiPick masks unpredictable inference time budgets and therefore meets SLA targets, while improving accuracy within mobile network constraints. We evaluate ModiPick through experiments based on prototype systems and through simulations. We show that ModiPick achieves comparable inference accuracy to a greedy approach while improving SLA adherence by up to 88.5%.

Confidential Deep Learning: Executing Proprietary Models on Untrusted Devices

Peter M. VanNostrand, Ioannis Kyriazis, Michelle Cheng, Tian Guo, Robert J. Walls
arXiv:1908.10730v1

Performing deep learning on end-user devices provides fast offline inference results and can help protect the user’s privacy. However, running models on untrusted client devices reveals model information which may be proprietary, i.e., the operating system or other applications on end-user devices may be manipulated to copy and redistribute this information, infringing on the model provider’s intellectual property. We propose the use of ARM TrustZone, a hardware-based security feature present in most phones, to confidentially run a proprietary model on an untrusted end-user device. We explore the limitations and design challenges of using TrustZone and examine potential approaches for confidential deep learning within this environment. Of particular interest is providing robust protection of proprietary model information while minimizing total performance overhead.

Poster: Virtual Reality Streaming at the Edge: A Power Perspective

Zichen Zhu, Nan Jiang, Tian Guo, and Sheng Wei
ACM/IEEE Symposium on Edge Computing (SEC 2019)

This poster focuses on addressing the power consumption issues in 360-degree immersive video streaming on smartphones, an emerging virtual reality (VR) application in the consumer video market. We first conducted a power measurement study that indicates VR view generation as the major power consumption source. Then, we developed an edge-based immersive streaming system called EdgeVR that offloads the power-consuming view generation operation from the smartphone to the edge. Through our preliminary evaluations using EdgeVR, we identified the challenge of Motion-to-Photon latency associated with offloading. To reduce such delay, we propose a viewport prediction-based pre-rendering mechanism at the edge and thus ensuring the quality of experience in the VR application.

Poster: EdgeServe: Efficient Deep Learning Model Caching at the Edge

Tian Guo, Robert J. Walls, Samuel S. Ogden
ACM/IEEE Symposium on Edge Computing (SEC 2019)

In this work, we look at how to effectively manage and utilize these deep learning models at each edge location, to provide performance guarantees to inference requests. We identify challenges to use these deep learning models at resourceconstrained edge locations, and propose to adapt existing cache algorithms to effectively manage these deep learnings models.

Presentation: Silhouette: Efficient Intra-Address Space Isolation for Protected Shadow Stacks on Embedded Systems

Jie Zhou, Yufei Du, Lele Ma, Zhuojia Shen, John Criswell, Robert J. Walls

Presentation: Confidential Deep Learning: Executing Proprietary Models on Untrusted Devices

Peter M. VanNostrand, Ioannis Kyriazis, Michelle Cheng, Tian Guo, Robert J. Walls
Great Lakes Security Day 2019

Performing machine learning on client devices is desirable as it provides fast, offline inference results and can protect the user's privacy. However,running models on untrusted client devices reveals information about the model such as structure and neuron weights which may be proprietary. As users have full access to the hardware and software of their devices, the client operating system or other applications may be manipulated to copy and redistribute this information, infringing on the model provider's intellectual property. We propose the use of ARM TrustZone, a hardware security module present in most phones, to provide a trusted environment for the execution of machine learning models. Outside the trusted execution environment, all model information would be kept encrypted to ensure model confidentiality. We explore the limitations and design challenges of using ARM TrustZone and examine potential approaches for confidentiality performing deep learning within this environment. Of particular interest is providing robust protection of proprietary model information while minimizing total performance overhead.

CloudCoaster: Transient-aware Bursty Datacenter Workload Scheduling

Samuel S. Ogden, Tian Guo
arXiv:1907.02162

Today's clusters often have to divide resources among a diverse set of jobs. These jobs are heterogeneous both in execution time and in their rate of arrival. Execution time heterogeneity has lead to the development of hybrid schedulers that can schedule both short and long jobs to ensure good task placement. However, arrival rate heterogeneity, or burstiness, remains a problem in existing schedulers. These hybrid schedulers manage resources on statically provisioned cluster, which can quickly be overwhelmed by bursts in the number of arriving jobs.
In this paper we propose CloudCoaster, a hybrid scheduler that dynamically resizes the cluster by leveraging cheap transient servers. CloudCoaster schedules jobs in an intelligent way that increases job performance while reducing overall resource cost. We evaluate the effectiveness of CloudCoaster through simulations on real-world traces and compare it against a state-of-art hybrid scheduler. CloudCoaster improves the average queueing delay time of short jobs by 4.8X while maintaining long job performance. In addition, CloudCoaster reduces the short partition budget by over 29.5%.

Control-Flow Integrity for Real-Time Embedded Systems

Robert J. Walls, Nicholas F. Brown, Thomas Le Baron, Craig A. Shue, Hamed Okhravi, Bryan C. Ward
31st Euromicro Conference on Real-Time Systems (ECRTS 2019)

Attacks on real-time embedded systems can endanger lives and critical infrastructure. Despitethis, techniques for securing embedded systems software have not been widely studied. Manyexisting security techniques for general-purpose computers rely on assumptions that do not hold inthe embedded case. This paper focuses on one such technique, control-flow integrity (CFI), thathas been vetted as an effective countermeasure against control-flow hijacking attacks on general-purpose computing systems. Without the process isolation and fine-grained memory protectionsprovided by a general-purpose computer with a rich operating system, CFI cannot provide anysecurity guarantees. This work proposes RECFISH, a system for providing CFI guarantees onARM Cortex-R devices running minimal real-time operating systems. We provide techniques forprotecting runtime structures, isolating processes, and instrumenting compiled ARM binaries withCFI protection. We empirically evaluate RECFISH and its performance implications for real-timesystems. Our results suggest RECFISH can be directly applied to binaries without compromisingreal-time performance; in a test of over six million realistic task systems running FreeRTOS, 85%were still schedulable after adding RECFISH.

Presentation: A Random Number Generator Built from Repurposed Hardware in Embedded Systems

Jacob T. Grycel and Robert J. Walls
New England Security Day 2019

Speeding up Deep Learning with Transient Servers

Shijian Li, Robert J. Walls, Lijie Xu, Tian Guo
The 16th IEEE International Conference on Autonomic Computing, arXiv:1903.00045

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable—e.g., for rapidly evaluating new model designs—they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs.
We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

An Experimental Evaluation of Garbage Collectors on Big Data Applications

Lijie Xu,Tian Guo, Wensheng Dou, Wei Wang, and Jun Wei
The 45th International Conference on Very Large Data Bases (VLDB'19)

Popular big data frameworks, ranging from Hadoop MapReduce to Spark, all rely on garbage-collected languages, such as Java and Scala. Big data applications are especially sensitive to the effectiveness of garbage collection (i.e., GC), because they usually process a large number of data objects that lead to heavy GC overhead. Lacking in-depth understanding of GC performance has impeded performance improvement in big data applications. In this paper, we conduct a comprehensive evaluation on three popular garbage collectors, i.e., Parallel, CMS, and G1, using four representative Spark applications. By thoroughly investigating the correlation between these big data applications’ memory usage patterns and the collectors’ GC patterns, we obtain many findings about GC inefficiencies. We further propose empirical guidelines for application developers, and insightful optimization strategies for designing bigdata-friendly garbage collectors.

Cloud-based or On-device: An Empirical Study of Mobile Deep Inference

Tian Guo
2018 IEEE International Conference on Cloud Engineering (IC2E'18)

Modern mobile applications benefit significantly from the advancement in deep learning, e.g., implementing real-time image recognition and conversational system. Given a trained deep learning model, applications usually need to perform a series of matrix operations based on the input data, in order to infer possible output values. Because of computation complexity and size constrained, these trained models are often hosted in the cloud. When utilizing these cloud-based models, mobile apps will have to send input dat over the network. While cloud-based deep learning can provide reasonable response time for mobile apps, it also restricts the use case scenarios, e.g. mobile apps need to have access to network. With mobile specific deep learning optimizations, it is now possible to employ on-device inference. However, because mobile hardware, e.g. GPU and memory size, can be very limited when compared to desktop counterpart, it is important to understand the feasibility of this new on-device deep learning inference architecture. In this paper, we empirically evaluate the inference efficiency of three Convolutional Neural Networks using a benchmark Android application we developed. Our measurement and analysis suggest that on-device inference can cost up to two orders of magnitude response time and energy when compared to cloud-based inference, and loading model and computing probability are two performance bottlenecks for on-device deep inferences.

MODI: Mobile Deep Inference Made Efficient by Edge Computing

Samuel S. Ogden, Tian Guo
The USENIX Workshop on Hot Topics in Edge Computing (HotEdge '18)

In this paper, we propose a novel mobile deep inference platform, MODI, that delivers good inference performance. MODI improves deep learning powered mobile applications performance with optimizations in three complementary aspects. First, ODI provides a number of models and dynamically selects the best one during runtime. econd, MODI extends the set of models each mobile application can use by storing high quality models at the edge servers. Third, MODI manages a centralized model repository and periodically updates models at edge locations ensuring up-to-date models for mobile applications without incurring high network latency. Our evaluation demonstrate the feasibility of trading off inference accuracy for improved inference speed, as well as the acceptable performance of edge-based inference.

Towards Efficient Deep Inference for Mobile Applications

Tian Guo
arXiv:1707.04610

Modern mobile applications are benefiting significantly from the advancement in deep learning, e.g., implementing real-time image recognition and conversational system. Given a trained deep learning model, applications usually need to perform a series of matrix operations based on the input data, in order to infer possible output values. Because of computational complexity and size constraints, these trained models are often hosted in the cloud. To utilize these cloud-based models, mobile apps will have to send input data over the network. While cloud-based deep learning can provide reasonable response time for mobile apps, it restricts the use case scenarios, e.g. mobile apps need to have network access. With mobile specific deep learning optimizations, it is now possible to employ on-device inference. However, because mobile hardware, such as GPU and memory size, can be very limited when compared to its desktop counterpart, it is important to understand the feasibility of this new on-device deep learning inference architecture. In this paper, we empirically evaluate the inference performance of three Convolutional Neural Networks (CNNs) using a benchmark Android application we developed. Our measurement and analysis suggest that on-device inference can cost up to two orders of magnitude greater response time and energy when compared to cloud-based inference, and that loading model and computing probability are two performance bottlenecks for on-device deep inferences.

Managing Risk in a Derivative IaaS Cloud

Prateek Sharma, Stephen Lee, Tian Guo, David Irwin, and Prashant Shenoy
IEEE Transactions on Parallel and Distributed Systems (TPDS'17)

Infrastructure-as-a-Service (IaaS) cloud platforms rent computing resources with different cost and availability tradeoffs. For example, users may acquire virtual machines (VMs) in the spot market that are cheap, but can be unilaterally terminated by the cloud operator. Because of this revocation risk, spot servers have been conventionally used for delay and risk tolerant batch jobs. In this paper, we develop risk mitigation policies which allow even interactive applications to run on spot servers. Our System, SpotCheck is a derivative cloud platform, and provides the illusion of an IaaS platform that offers always-available VMs on demand for a cost near that of spot servers, and supports unmodified applications. SpotCheck’s design combines virtualization-based mechanisms for fault-tolerance, and bidding and server selection policies for managing the risk and cost. We implement SpotCheck on EC2 and show that it i) provides nested VMs with 99.9989% availability, ii) achieves nearly 5× cost savings compared to using on-demand VMs, and iii) eliminates any risk of losing VM state.

Latency-aware Virtual Desktops Optimization in Distributed Clouds

Tian Guo, Prashant Shenoy, K. K. Ramakrishnan, and Vijay Gopalakrishnan
Multimedia Systems (MMSJ'17)

Distributed clouds offer a choice of data center locations for providers to host their applications. In this paper we consider distributed clouds that host virtual desktops which are then accessed by users through remote desktop protocols. Virtual desktops have different levels of latency-sensitivity, primarily determined by the actual applications running and affected by the end users’ locations. In the scenario of mobile users, even switching between 3G and WiFi networks affects the latency sensitivity. We design VMShadow, a system to automatically optimize the location and performance of latency-sensitive VMs in the cloud. VMShadow performs black-box fingerprinting of a VM’s network traffic to infer the latency-sensitivity and employs both ILP and greedy heuristic based algorithms to move highly latency-sensitive VMs to cloud sites that are closer to their end users. VMShadow employs a WAN-based live migration and a new network connection migration protocol to ensure that the VM migration and subsequent changes to the VM’s network address are transparent to end-users. We implement a prototype of VMShadow in a nested hypervisor and demonstrate its effectiveness for optimizing the performance of VM-based desktops in the cloud. Our experiments on a private as well as the public EC2 cloud show that VMShadow is able to discriminate between latency-sensitive and insensitive desktop VMs and judiciously moves only those that will benefit the most from the migration. For desktop VMs with video activity, VMShadow improves VNC’s refresh rate by 90% by migrating virtual desktop to the closer location. Transcontinental remote desktop migrations only take about 4 minutes and our connection migration proxy imposes 13µs overhead per packet.

Performance and Cost Considerations for Providing Geo-Elasticity in Database Clouds

Tian Guo, and Prashant Shenoy
Transactions on Autonomous and Adaptive Systems (TAAS'17)

Online applications that serve global workload have become a norm and those applications are experiencing not only temporal but also spatial workload variations. In addition, more applications are hosting their backend tiers separately for benefits such as ease of management. To provision for such applications, traditional elasticity approaches that only consider temporal workload dynamics and assume well-provisioned backends are insufficient. Instead, in this paper, we propose a new type of provisioning mechanisms---geo-elasticity, by utilizing distributed clouds with different locations. Centered this idea, we build a system called DBScale that tracks geographic variations in the workload to dynamically provision database replicas at different cloud locations across the globe. Our geo-elastic provisioning approach comprises a regression-based model that infers database query workload from spatially distributed front-end workload, a two-node open queueing network model that estimates the capacity of databases serving both CPU and I/O-intensive query workloads, and greedy algorithms for selecting best cloud locations based on latency and cost. We implement a prototype of our DBScale system on Amazon EC2’s distributed cloud. Our experiments with our prototype show up to a 66% improvement in response time when compared to local elasticity approaches.

On the Feasibility of Cloud-Based SDN Controllers for Residential Networks

Curtis R. Taylor, Tian Guo, Craig A. Shue, and Mohamed E. Najd
2017 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN'17)

Residential networks are home to increasingly diverse devices, including embedded devices that are part of the Internet of Things phenomenon, leading to new management and security challenges. However, current residential solutions that rely on customer premises equipment (CPE), which often remains deployed in homes for years without updates or maintenance, are not evolving to keep up with these emerging demands. Recently, researchers have proposed to outsource the tasks of managing and securing residential networks to cloud-based security services by leveraging software-defined networking (SDN). However, the use of cloud-based infrastructure may have performance implications. In this paper, we measure the performance impact and perception of a residential SDN using a cloud-based controller through two measurement studies. First, we recruit 270 residential users located across the United States to measure residential latency to cloud providers. Our measurements suggest the cloud controller architecture provides 90% of end-users with acceptable performance with judiciously selected public cloud locations. When evaluating web page loading times of popular domains, which are particularly latency-sensitive, we found an increase of a few seconds at the median. However, optimizations could reduce this overhead for top websites in practice.

Providing Geo-Elasticity in Geographically Distributed Clouds

Tian Guo, Prashant Shenoy
ACM Transactions on Internet Technology (TOIT'17)

Geographically distributed cloud platforms are well suited for serving a geographically diverse user base. However traditional cloud provisioning mechanisms that make local scaling decisions are not adequate for delivering best possible performance for modern web applications that observe both temporal and spatial workload fluctuations. In this paper, we propose GeoScale, a system that provides geo-elasticity by combining model-driven proactive and agile reactive provisioning approaches. GeoScale can dynamically provision server capacity at any location based on workload dynamics. We conduct a detailed evaluation of GeoScale on Amazon’s geo-distributed cloud, and show up to 40% improvement in the 95th percentile response time when compared to traditional elasticity techniques.

Elastic Resource Management in Distributed Clouds

Tian Guo
Ph.D. thesis, University of Massachusetts Amherst.

The ubiquitous nature of computing devices and their increasing reliance on remote resources have driven and shaped public cloud platforms into unprecedented large-scale, distributed data centers. Concurrently, a plethora of cloud-based applications are experiencing multi-dimensional workload dynamics—workload volumes that vary along both time and space axes and with higher frequency. The interplay of diverse workload characteristics and distributed clouds raises several key challenges for efficiently and dynamically managing server resources. First, current cloud platforms impose certain restrictions that might hinder some resource management tasks. Second, an application-agnostic approach might not entail appropriate performance goals, therefore, requires numerous specific methods. Third, provisioning resources outside LAN boundary might incur huge delay which would impact the desired agility. In this dissertation, I investigate the above challenges and present the design of automated systems that manage resources for various applications in distributed clouds. The intermediate goal of these automated systems is to fully exploit potential benefits such as reduced network latency offered by increasingly distributed server resources. The ultimate goal is to improve end-to-end user response time with novel resource management approaches, within a certain cost budget. Centered around these two goals, I first investigate how to optimize the location and performance of virtual machines in distributed clouds. I use virtual desktops, mostly serving a single user, as an example use case for developing a black-box approach that ranks virtual machines based on their dynamic latency requirements. Those with high latency sensitivities have a higher priority of being placed or migrated to a cloud location closest to their users. Next, I relax the assumption of well-provisioned virtual machines and look at how to provision enough resources for applications that exhibit both temporal and spatial workload fluctuations. I propose an application-agnostic queueing model that captures the resource utilization and server response time. Building upon this model, I present a geo-elastic provisioning approach—referred as geo-elasticity—for replicable multi-tier applications that can spin up an appropriate amount of server resources in any cloud locations. Last, I explore the benefits of providing geo-elasticity for database clouds, a popular platform for hosting application backends. Performing geo-elastic provisioning for backend database servers entails several challenges that are specific to database workload, and therefore requires tailored solutions. In addition, cloud platforms offer resources at various prices for different locations. Towards this end, I propose a cost-aware geo-elasticity that combines a regression-based workload model and a queueing network capacity model for database clouds. In summary, hosting a diverse set of applications in an increasingly distributed cloud makes it interesting and necessary to develop new, efficient and dynamic resource management approaches.

Analyzing the Efficiency of a Green University Data Center

Patrick Pegus II, Benoy Varghese, Tian Guo, David Irwin, Prashant Shenoy, Anirban Mahanti, James Culbert, John Goodhue, Chris Hill
Proceedings of 2016 ACM International Conference on Performance Engineering (ICPE'16)

Data centers are an indispensable part of today’s IT infrastructure. To keep pace with modern computing needs, data centers continue to grow in scale and consume increasing amounts of power. While prior work on data centers has led to significant improvements in their energy-efficiency, detailed measurements from these facilities’ operations are not widely available, as data center design is often considered part of a company’s competitive advantage. However, such detailed measurements are critical to the research community in motivating and evaluating new energy-efficiency optimizations. In this paper, we present a detailed analysis of a state-of-the-art 15MW green multi-tenant data center that incorporates many of the technological advances used in commercial data centers. We analyze the data center’s computing load and its impact on power, water, and carbon usage using standard effectiveness metrics, including PUE, WUE, and CUE. Our results reveal the benefits of optimizations, such as free cooling, and provide insights into how the various effectiveness metrics change with the seasons and increasing capacity usage. More broadly, our PUE, WUE, and CUE analysis validate the green design of this LEED Platinum data center.

GeoScale: Providing Geo-Elasticity in Distributed Clouds

Tian Guo, Prashant Shenoy, Hakan Hacigumus
Proceedings of 2016 IEEE International Conference on Cloud Engineering (IC2E'16)

Distributed cloud platforms are well suited for serving a geographically diverse user base. However traditional cloud provisioning mechanisms that make local scaling decisions are not well suited for temporal and spatial workload fluctuations seen by modern web applications. In this paper, we argue the need of geo-elasticity and present GeoScale, a system to provide geo-elasticity in distributed clouds. We describe GeoScale’s model-driven proactive provisioning ap- proach and conduct an initial evaluation of GeoScale on Amazon’s distributed EC2 cloud. Our results show up to 31% improvement in the 95th percentile response time when compared to traditional elasticity techniques.

Flint: Batch-Interactive Data-Intensive Processing on Transient Servers

Prateek Sharma, Tian Guo, Xin He, David Irwin, Prashant Shenoy
Procceedings of the Eleventh European Conference on Computer Systems (EuroSys'16)

Cloud providers now offer transient servers, which they may revoke at anytime, for significantly lower prices than on-demand servers, which they cannot revoke. Transient servers’ low price is particularly attractive for executing an emerging class of workload, which we call Batch-Interactive Data-Intensive (BIDI), that is becoming increasingly impor- tant for data analytics. BIDI workloads require large sets of servers to cache massive datasets in memory to enable low latency operation. In this paper, we illustrate the challenges of executing BIDI workloads on transient servers, where re- vocations (akin to failures) are the common case. To address these challenges, we design Flint, which is based on Spark and includes automated checkpointing and server selection policies that i) support batch and interactive applications and ii) dynamically adapt to application characteristics. We evaluate a prototype of Flint using EC2 spot instances, and show that it yields cost savings of up to 90% compared to using on-demand servers, while increasing running time by < 2%.

Placement Strategies for Virtualized Network Functions in a NFaaS Cloud

Xin He, Tian Guo, Erich Nahum and Prashant Shenoy
Fourth IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb'16)

Enterprises that host services in the cloud need to protect their cloud resources using network services such as firewalls and deep packet inspection systems. While middleboxes have typically been used to implement such network functions in traditional enterprise networks, their use in cloud environments by cloud tenants is problematic due to the boundary between cloud providers and cloud tenants. Instead we argue that network function virtualization is a natural fit in cloud environments, where the cloud provider can implement Network Functions as a Service using virtualized network functions running on cloud servers, and enterprise cloud tenants can employ these services to implement security and performance optimizations for their cloud resources. In this paper, we focus on placement issues in the design of a NFaaS cloud and present two placement strategies---tenant-centric and service-centric---for deploying virtualized network services in multi-tenant settings. We discuss several trade-offs of these two strategies. We implement a prototype NFaaS testbed and conduct a series of experiments to show to quantify the benefits and drawbacks of our two strategies. Our results suggest that the tenant-centric placement provides lower latencies while service-centric approach is more flexible for reconfiguration and capacity scaling.

SpotCheck: Designing a Derivative IaaS Cloud on the Spot Market

Prateek Sharma, Stephen Lee, Tian Guo, David Irwin, and Prashant Shenoy
Procceedings of the Tenth European Conference on Computer Systems (EuroSys'15)

nfrastructure-as-a-Service (IaaS) cloud platforms rent resources, in the form of virtual machines (VMs), under a variety of contract terms that offer different levels of risk and cost. For example, users may acquire VMs in the spot market that are often cheap but entail significant risk, since their price varies over time based on market supply and demand and they may terminate at any time if the price rises too high. Currently, users must manage all the risks associated with using spot servers. As a result, conventional wisdom holds that spot servers are only appropriate for delay-tolerant batch applications. In this paper, we propose a derivative cloud platform, called SpotCheck, that transparently manages the risks associated with using spot servers for users. SpotCheck provides the illusion of an IaaS platform that offers always-available VMs on demand for a cost near that of spot servers, and supports all types of applications, including interactive ones. SpotCheck’s design combines the use of nested VMs with live bounded-time migration and novel server pool management policies to maximize availability, while balancing risk and cost. We implement SpotCheck on Amazon’s EC2 and show that it i) provides nested VMs to users that are 99.9989% available, ii) achieves nearly 5X cost savings compared to using equivalent types of on-demand VMs, and iii) eliminates any risk of losing VM state.

Model-driven Geo-Elasticity In Database Clouds

Tian Guo and Prashant Shenoy
International Conference on Autonomic Computing and Communications (ICAC'15)

Motivated by the emergence of distributed clouds, we argue for the need for geo-elastic provisioning of application replicas to effectively handle temporal and spatial workload fluctuations seen by such applications. We present DBScale, a system that tracks geographic variations in the workload to dynamically provision database replicas at different cloud locations across the globe. Our geo-elastic provisioning approach comprises a regression-based model to infer the database query workload from observations of the spatially distributed frontend workload and a two-node open queueing network model to provision databases with both CPU and I/O-intensive query workloads. We implement a prototype of our DBScale system on Amazon EC2’s distributed cloud. Our experiments with our prototype show up to a 66% improvement in response time when compared to local elasticity approaches.

SpotOn: A Batch Computing Service for the Spot Market

Supreeth Subramanya, Tian Guo, Prateek Sharma, David Irwin, and Prashant Shenoy

Cloud spot markets enable users to bid for compute resources, such that the cloud platform may revoke them if the market price rises too high. Due to their increased risk, revocable resources in the spot market are often significantly cheaper (by as much as 10X) than the equivalent non-revocable on-demand resources. One way to mitigate spot market risk is to use various fault-tolerance mechanisms, such as checkpointing or replication, to limit the work lost on revocation. However, the additional performance overhead and cost for a particular fault-tolerance mechanism is a complex function of both an application’s resource usage and the magnitude and volatility of spot market prices. We present the design of a batch computing service for the spot market, called SpotOn, that automatically selects a spot market and fault-tolerance mechanism to mitigate the impact of spot revocations without requiring application modification. SpotOn’s goal is to execute jobs with the performance of on-demand resources, but at a cost near that of the spot market. We implement and evaluate SpotOn in simulation and using a prototype on Amazon’s EC2 that packages jobs in Linux Containers. Our simulation results using a job trace from a Google cluster indicate that SpotOn lowers costs by 91.9% compared to using on-demand resources with little impact on performance. paper-url: /assets/papers/spoton.pdf info: Proceedings of the 6th Annual Symposium on Cloud Computing (SoCC'15) filters: conference

Cost-Aware Cloud Bursting for Enterprise Applications

Tian Guo, Upendra Sharma, Prashant Shenoy, Timothy Wood, and Sambit Sahu
ACM Transactions on Internet Technology (TOIT'14)

The high cost of provisioning resources to meet peak application demands has led to the widespread adoption of pay-as-you-go cloud computing services to handle workload fluctuations. Some enterprises with existing IT infrastructure employ a hybrid cloud model where the enterprise uses its own private resources for the majority of its computing, but then “bursts” into the cloud when local resources are insufficient. However, current commercial tools rely heavily on the system administrator’s knowledge to answer key questions such as when a cloud burst is needed and which applications must be moved to the cloud. In this paper we describe Seagull, a system designed to facilitate cloud bursting by determining which applications should be transitioned into the cloud and automating the movement process at the proper time. Seagull optimizes the bursting of applications using an optimization algorithm as well as a more efficient but approximate greedy heuristic. Seagull also optimizes the overhead of deploying applications into the cloud using an intelligent precopying mechanism that proactively replicates virtualized applications, lowering the bursting time from hours to minutes. Our evaluation shows over 100% improvement compared to naive solutions but produces more expensive solutions compared to ILP. However, the scalability of our greedy algorithm is dramatically better as the number of VMs increase. Our evaluation illustrates scenarios where our prototype can reduce cloud costs by more than 45% when bursting to the cloud, and that the incremental cost added by precopying applications is offset by a burst time reduction of nearly 95%.

VMShadow: Optimizing the Performance of Latency-sensitive Virtual Desktops in Distributed Clouds

Tian Guo, Vijay Gopalakrishnan, K. K. Ramakrishnan, Prashant Shenoy, Arun Venkataramani, and Seungjoon Lee
Proceedings of the 5th ACM Multimedia Systems Conference (MMSys'14)

Distributed clouds offer a choice of data center locations to application providers to host their applications. In this paper we consider distributed clouds that host virtual desktops which are then accessed by their users through remote desktop protocols. We argue that virtual desktops that run latency-sensitive applications such as games or video players are particularly sensitive to the choice of the cloud data center location. We design VMShadow, a system to automatically optimize the location and performance of location-sensitive virtual desktops in the cloud. VMShadow performs black-box fingerprinting of a VM’s network traffic to infer its location-sensitivity and employs a greedy heuristic based algorithm to move highly location-sensitive VMs to cloud sites that are closer to their end-users. VMShadow employs WAN-based live migration and a new network connection migration protocol to ensure that the VM migration and subsequent changes to the VM’s network address are transparent to end-users. We implement a prototype of VMShadow in a nested hypervisor and demonstrate its effectiveness for optimizing the performance of VM-based desktops in the cloud. Our experiments on a private and the public EC2 cloud show that VMShadow is able to discriminate between location-sensitive and insensitive desktop applications and judiciously move only those VMs that will benefit the most. For desktop VMs with video activity, VMShadow improves VNC’s refresh rate by 90%. Further our connection migration proxy, which utilizes dynamic rewriting of packet headers, imposes a rewriting overhead of only 13µs per packet. Trans-continental VM migrations take about 4 minutes.

VMShadow: Optimizing The Performance of Virtual Desktops in Distributed Clouds

Tian Guo, Vijay Gopalakrishnan, K. K. Ramakrishnan, Prashant Shenoy, Arun Venkataramani, and Seungjoon Lee
Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC '13)

We present VMShadow, a system that automatically optimizes the location and performance of applications based on their dynamic workloads. We prototype VMShadow and demonstrate its efficacy using VM-based desktops in the cloud as an example application. Our experiments on a private cloud as well as the EC2 cloud, using a nested hypervisor, show that VMShadow is able to discriminate between location-sensitive and location-insensitive desktop VMs and judiciously moves only those that will benefit the most from the migration. For example, VMShadow performs transcontinental VM migrations in ∼4 mins and can improve VNC’s video refresh rate by up to 90%.

Seagull: Intelligent Cloud Bursting for Enterprise Applications

Tian Guo, Upendra Sharma, Timothy Wood, Sambit Sahu, and Prashant Shenoy
Proceedings of the 2012 USENIX conference on Annual Technical Conference (ATC'12)

Enterprises with existing IT infrastructure are beginning to employ a hybrid cloud model where the enterprise uses its own private resources for the majority of its computing, but then “bursts” into the cloud when local resources are insufficient. However, current approaches to cloud bursting cannot be effectively automated because they heavily rely on system administrator knowledge to make decisions. In this paper we describe Seagull, a system designed to facilitate cloud bursting by determining which applications can be transitioned into the cloud most economically, and automating the movement process at the proper time. We further optimize the deployment of applications into the cloud using an intelligent precopying mechanism that proactively replicates virtualized applications, lowering the bursting time from hours to minutes. Our evaluation illustrates how our prototype can reduce cloud costs by more than 45% when bursting to the cloud, and the incremental cost added by precopying applications is offset by a burst time reduction of nearly 95%.