Summarization for big data sets

Summarization is a key feature to handle large volumes of data efficiently.
There exist several methods that allow for a compact summarized representation of large data sets like Bloom filters, hash trees and locality-sensitive hashing.
A Bloom filter is a compact representation of a large data set.
It is a probabilistic data structure that can produce false positives; in contrast, it never produces false negatives.
Hence it is the ideal tool to efficiently decide whether a value is not contained in a large set.
Hash trees allow for a quick detection of identical two identical lists of documents; in case of non-identical document lists, hash trees also enable a quick detection of changed parts in the two lists.
Minhashing is a method to represent characteristics of large texts in a condensed form; minhashes can be combined with similarity measures and then applied for locality-sensitive hashing.
Applications of these techniques cover key-based data access, peer-to-peer message exchange and similarity-based search.


Free online lecture on NoSQL and distributed databases

Dear colleagues.

You can now find a playlist of my screencast on NoSQL databases on Youtube:

1 Intro and RDBMSs
2 Graph databases
3 XML databases
4 Key-value stores and document databases
5 Column stores
6 Extensible Record stores
7 Polyglot data management
8 Distributed data management

Moreover, all figures from the book “Advanced Data Management” are freely available on the web site:

Best regards
Lena Wiese

CfP: BigBIA@BTW’17

Call for Papers:
Big Data Management Systems in Business and Industrial Applications

[Co-located with the 17th Conference on Database Systems for Business, Technology, and Web]

Big Data stands for the intelligent and efficient handling and usage of large, heterogeneous and fast changing amounts of data. The ultimate goal of big data is the generation of valuable insights from ever growing amounts of data. Hence, the application of Big Data in business and industry contexts has proven to be very valuable. However, several challenges related to big data remain; these challenges go beyond the often used catchphrases: volume, velocity, variety, and veracity and also address security, privacy, linked-data technologies in the context of a wide range of AI applications. Big Data is more than just data analysis since the outcome influences the way how digital businesses and a knowledge economy are organized now and in the future.

This workshop is dedicated to the application of Big Data concepts and technologies in real-world systems for a variety of application domains like Manufacturing, Logistics, Media, Healthcare, and Finance.
Topics of Interest

Topics of interest include, but are not limited to:

  • Tools and Technologies
    Big data technologies (MapReduce, NoSQL, InMemory, Parallel Data Processing)
    Applied security in big data contexts
    Tools for data mapping, data exploration and data cleaning Software architectures for Big Data and Data Analytics
    Development methodologies for Big Data Applications
    Technologies for the generation of big data (sensor networks) Stream processing architectures
  • Big Data and Data Analytics Applications in Business and Industry
    Process Control and Optimization
    Event-Driven Business Process Management
    Decision Support
    Preventive and Predictive Maintenance
    Energy management
    Industrial security
    Environmental compliance enforcement
    Social media analytics for market intelligence
  • Algorithms and Methods for Business and Industrial Analytics
    Predictive Analytics, Anomaly Detection, Machine Learning, Visual Analytics
    Data quality and cleaning
    Mining heterogeneous data
    Event detection in complex data
    Complex Event Processing
    Unsupervised and supervised learning from large amounts of data Self-optimization of complex industrial processes

Submission Guidelines

We welcome practical experience reports as well as theoretical analyses of novel approaches and algorithms. Full (4 to 8 pages) and short (2 to 3 pages) papers are solicited. Submissions can be written in German or English and must consist of an original unpublished work not being under review elsewhere. Submission must adhere to the LNI formatting guidelines of the BTW main conference and must be uploaded as PDF documents through the BTW online submission system (Workshop BigBIA17). Submissions will be reviewed in a single blind review process by the workshop program committee members. Accepted papers will be published in GI Lecture Notes in Informatics (LNI).

Important Dates
06.11.2016    Submission of Contributions
04.12.2016    Author Notification
18.12.2016    Camera Ready
XX.03.2017    Workshop

New publication on data fragmentation

I and my coauthors just published a journal paper “A Replication Scheme for Multiple Fragmentations with Overlapping Fragments” in Oxford Computer Journal:

This paper addresses the efficient management of data in a distributed database system.
When storing large-scale data sets in distributed database systems, they are usually fragmented (also called partitioned) into smaller subsets.
Moreover, to achieve better availability and failure tolerance, copies of the data sets (the so-called replicas) are created. Different replicas of the same data set should reside on distinct servers. A major challenge with data fragmentation and replication is to enable efficient query answering while retrieving data from several servers.

In addition to technical requirements of data distribution, intelligent query answering mechanisms are increasingly important to find relevant answers to user queries. Our approach clusters the data according to several so-called relaxation attributes in a base table.
In order to offer intelligent query answering with only a modest overhead, the paper describes a method to management different fragmentations (optimized for different query intents) in parallel.
This method also includes an advanced replication scheme that takes advantage of overlapping fragments.


CfP: BTW 2017 Stuttgart

The conference Databases for Business, Technology and Web will take place in Stuttgart from March 6th – 10th 2017.

For almost three decades, the BTW conference has been the central forum for the German-speaking database community.

The BTW conference addresses open challenges and possible solutions likewise. Some of the various hot topics of this 2017 conference are: new hardware and memory technologies, information extraction, information integration, big data analytics, web data management, service-oriented computing and cloud computing.



CfP: NoSQL-Net2016@DEXA (Porto, September 5 – 8)

Call for Papers:
3rd International Workshop on Emerging Database Technologies and Linked
Data Applications

in conjunction with DEXA 2016 in Porto, September 5 – 8, 2016.

In the last few years, NoSQL databases, main memory databases and stream data management as well as RDF-based technologies have matured enough, so that the Gartner, Inc., the world’s leading information technology research and advisory company, has identified them as the top technology trends that will play key roles in modernizing information management in 2013 and beyond. Thus it is natural that a ‘Emerging Database Technologies and Linked Data Applications’ workshop is organized in conjunction with the DEXA conference, one of the oldest conferences for intelligent information management. As in the previous two years, the workshop welcomes submissions covering technological advances as well as industrial or scientific applications of novel data management technologies.

A major focus will be given this year to semantic technologies based on the Linked Data paradigm. A semantic approach to publishing and sharing information (based on the Linked Data paradigm) is already recognized and recommended in European strategic documents. RDF, the Resource Description Framework, is one of the key ingredients of Linked Data, and provides a generic graph-based data model for describing things, including their relationships with other things.

The new research challenges, that will be addressed in this workshop, are
related to:
– the emerging methods and techniques leveraged in NoSQL databases,
– the emerging database technologies and applications (mobile databases, multimedia databases, geographical information systems, biological / bioinformatics databases, sensor data management),
– the emerging pan-European data infrastructure that will enable interoperability, interlinking and reuse of open data in public and also commercial services,
– the NoSQL advantages in Social networking and Semantic Web applications,
– the need for enhanced security in the forthcoming business models (e-government, e-commerce),
– the alternative technologies for solving different Big Data workload–related problems.

*Areas of interests include, but are not limited to:
– Schema design and schema evolution for Linked Data
– Linked Data tools (platforms / frameworks / services)
– Linked Data standard technologies
– Linked Data workflow (publication / consumption)
– Linked Data applications (e.g., e-Government, e-Environment, or e-Health)
– Quality and trustworthiness of Linked Data
– Interoperability between different knowledge organization schemas
– Efficient and effective processing and management of Big Data
– Advances in NoSQL databases (key/value, columnar, document)
– Advances in Graph databases
– Advances in In-Memory data management
– Map/Reduce framework and its exploitation
– Security mechanisms for NoSQL management

* Important dates:
Submission of Abstracts: March 31, 2016
Submission of Full Papers: April 10, 2016
Notification of Acceptance: April 30, 2016
Camera-Ready Version Due: May 31, 2016
Conference date: September 5-8, 2016
Workshop date: to be specified


Searchable Encryption in Apache Cassandra

Due to their flexible data model, column family databases (aka wide column stores) like Apache Cassandra and Apache HBase are offered as Database-as-a-Service – for example by cloud storage providers like Google Cloud Platform, Microsoft Azure, Amazon Web Services, or Rackspace.
In a project funded by the German Research Council (DFG), my research group investigates modern (so-called property-preserving) encryption schemes for this type of databases.
Our recent results in [1] focus on searchable encryption schemes for Cassandra.

Customers can encrypt their data before sending them to the cloud to prevent access to their data by internal and external attackers:
“[…] outsourcing sensitive data to third party storage providers has always been a security risk, in the private sector (e.g. sharing of photos or health information, messaging) as well as in the business sector (e.g. classified documents or confidential mailing). Not only adversaries with physical access to data servers are potentially dangerous, (honest but) curious or malicious database administrators of hosting providers also may snoop on sensitive data and thereby
pose a thread.” [1]

Traditional strong encryption schemes however limit the functionality of the databases. In particular, the database can no longer search for records matching a keyword provided by the cloud customer.

With a searchable encryption scheme the cloud customer first of all encrypts data and sends it to the cloud database; later on, the customer sends an encrypted keyword to the cloud database so that the database can search for records matching the encrypted keyword; these matches are returned to the customer and then decrypted for further processing.

We implemented three approaches in a unified reference framework and hence provided an authoritative comparison of the different schemes.
The following schemes were chosen:

The sequential scan approach (called SWP [2]) encrypts every word in a record separately (a fixed word length has to be selected); while searching every encrypted word is sequentially compared to the provided encrypted keyword.

The index-per-keyword approach (called CGK [3]) creates an encrypted index for each word in all records (linking the keyword to matching record identifiers).

The index-per-record approach (called HK [4]) creates a forward index that maps each document to the encrypted keywords it contains. A second backward index stores a history of search results to speed up repeated searches for the same keyword; it maps encrypted keywords to records.

We tested the three implementation on a “distributed Cassandra Cluster consisting of two nodes, each equipped with a Intel Core i7 3770 CPU (@ 3.4GHz) and 16 GB RAM, running Ubuntu 14.04 LTS and Apache Cassandra 2.1.4.” [1]

“We employ the popular scenario of using searchable encryption for data in a mailbox. We use a subset of the TREC 2005 Spam Track Public Corpus [5]. We assume average mailbox sizes of 1,000 mails up to 10,000 mails.”

The first test runs measured performance of the encryption processes:

The “time needed for encryption grows linearly in all schemes. The HK scheme is the fastest with the SWP scheme being not significantly slower. Both schemes beat the CGK scheme roughly by a factor of 4.5.” [1]

The second test runs measured performance of the search processes:

“The high encryption effort of the CGK scheme pays off in sublinear search time (0.13 seconds when searching 10.000 mails). Due to its index […] only the HK scheme can be faster (constant search time), but only if searching for the same word again […]. It performs orders of magnitude worse when searching a word for the first time […]. Then it is almost as slow as the SWP scheme. Note that the SWP scheme as slowest one in our test still manages to search over half a million words per second.” [1]

[1]  Tim Waage, Ramaninder Singh Jhajj, Lena Wiese. Searchable Encryption in Apache Cassandra. In: Foundations and Practice of Security – 8th International Symposium, Lecture Notes in Computer Science. Springer, 2015.

[2] Song, D.X., Wagner, D., Perrig, A.: Practical techniques for searches on encrypted data. In: Security and Privacy, 2000. S&P 2000. Proceedings. 2000 IEEE Symposium on, IEEE (2000) 44–55

[3] Curtmola, R., Garay, J., Kamara, S., Ostrovsky, R.: Searchable symmetric encryption: improved definitions and efficient constructions. In: Proceedings of the 13th ACM conference on Computer and communications security, ACM (2006) 79–88

[4] Hahn, F., Kerschbaum, F.: Searchable encryption with secure and efficient updates. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, ACM (2014) 310–320

[5] Available at