Matching Items (7)

134154-Thumbnail Image.png

A Benchmark for Automated Fact Checking with Knowledge Bases

Description

The need for automated / computational fact checking has grown substantially in recent times due to the high volume of false information and limited workforce of human fact checkers. This

The need for automated / computational fact checking has grown substantially in recent times due to the high volume of false information and limited workforce of human fact checkers. This need has spawned research and new developments in this field and has created many different systems and approaches to this complex problem. This paper attempts to not just explain the most popular methods that are currently being used, but provide experimental results of the comparison of two different systems, the replication of results from their respective papers, and an annotated data-set of different test sentences to be used in these systems.

Contributors

Agent

Created

Date Created
  • 2017-12

149668-Thumbnail Image.png

Privacy preserving service discovery and ranking for multiple user QoS requirements in service-based software systems

Description

Service based software (SBS) systems are software systems consisting of services based on the service oriented architecture (SOA). Each service in SBS systems provides partial functionalities and collaborates with other

Service based software (SBS) systems are software systems consisting of services based on the service oriented architecture (SOA). Each service in SBS systems provides partial functionalities and collaborates with other services as workflows to provide the functionalities required by the systems. These services may be developed and/or owned by different entities and physically distributed across the Internet. Compared with traditional software system components which are usually specifically designed for the target systems and bound tightly, the interfaces of services and their communication protocols are standardized, which allow SBS systems to support late binding, provide better interoperability, better flexibility in dynamic business logics, and higher fault tolerance. The development process of SBS systems can be divided to three major phases: 1) SBS specification, 2) service discovery and matching, and 3) service composition and workflow execution. This dissertation focuses on the second phase, and presents a privacy preserving service discovery and ranking approach for multiple user QoS requirements. This approach helps service providers to register services and service users to search services through public, but untrusted service directories with the protection of their privacy against the service directories. The service directories can match the registered services with service requests, but do not learn any information about them. Our approach also enforces access control on services during the matching process, which prevents unauthorized users from discovering services. After the service directories match a set of services that satisfy the service users' functionality requirements, the service discovery approach presented in this dissertation further considers service users' QoS requirements in two steps. First, this approach optimizes services' QoS by making tradeoff among various QoS aspects with users' QoS requirements and preferences. Second, this approach ranks services based on how well they satisfy users' QoS requirements to help service users select the most suitable service to develop their SBSs.

Contributors

Agent

Created

Date Created
  • 2011

154589-Thumbnail Image.png

Predicting demographic and financial attributes in a bank marketing dataset

Description

Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by

Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by bank representatives with offers. These telemarketing strategies can be improved in combination with data mining techniques that allow predictability of customer information and interests. In this thesis, bank telemarketing data from a Portuguese banking institution were analyzed to determine predictability of several client demographic and financial attributes and find most contributing factors in each. Data were preprocessed to ensure quality, and then data mining models were generated for the attributes with logistic regression, support vector machine (SVM) and random forest using Orange as the data mining tool. Results were analyzed using precision, recall and F1 score.

Contributors

Agent

Created

Date Created
  • 2016

158510-Thumbnail Image.png

System Support for Large-scale Geospatial Data Analytics

Description

The volume of available spatial data has increased tremendously. Such data includes but is not limited to: weather maps, socioeconomic data, vegetation indices, geotagged social media, and more. These applications

The volume of available spatial data has increased tremendously. Such data includes but is not limited to: weather maps, socioeconomic data, vegetation indices, geotagged social media, and more. These applications need a powerful data management platform to support scalable and interactive analytics on big spatial data. Even though existing single-node spatial database systems (DBMSs) provide support for spatial data, they suffer from performance issues when dealing with big spatial data. Challenges to building large-scale spatial data systems are as follows: (1) System Scalability: The massive-scale of available spatial data hinders making sense of it using traditional spatial database management systems. Moreover, large-scale spatial data, besides its tremendous storage footprint, may be extremely difficult to manage and maintain due to the heterogeneous shapes, skewed data distribution and complex spatial relationship. (2) Fast analytics: When the user runs spatial data analytics applications using graphical analytics tools, she does not tolerate delays introduced by the underlying spatial database system. Instead, the user needs to see useful information quickly.

In this dissertation, I focus on designing efficient data systems and data indexing mechanisms to bolster scalable and interactive analytics on large-scale geospatial data. I first propose a cluster computing system GeoSpark which extends the core engine of Apache Spark and Spark SQL to support spatial data types, indexes, and geometrical operations at scale. In order to reduce the indexing overhead, I propose Hippo, a fast, yet scalable, sparse database indexing approach. In contrast to existing tree index structures, Hippo stores disk page ranges (each works as a pointer of one or many pages) instead of tuple pointers in the indexed table to reduce the storage space occupied by the index. Moreover, I present Tabula, a middleware framework that sits between a SQL data system and a spatial visualization dashboard to make the user experience with the dashboard more seamless and interactive. Tabula adopts a materialized sampling cube approach, which pre-materializes samples, not for the entire table as in the SampleFirst approach, but for the results of potentially unforeseen queries (represented by an OLAP cube cell).

Contributors

Agent

Created

Date Created
  • 2020

151407-Thumbnail Image.png

Context-aware rank-oriented recommender systems

Description

Recommender systems are a type of information filtering system that suggests items that may be of interest to a user. Most information retrieval systems have an overwhelmingly large number of

Recommender systems are a type of information filtering system that suggests items that may be of interest to a user. Most information retrieval systems have an overwhelmingly large number of entries. Most users would experience information overload if they were forced to explore the full set of results. The goal of recommender systems is to overcome this limitation by predicting how users will value certain items and returning the items that should be of the highest interest to the user. Most recommender systems collect explicit user feedback, such as a rating, and attempt to optimize their model to this rating value. However, there is potential for a system to collect implicit user feedback, such as user purchases and clicks, to learn user preferences. Additionally with implicit user feedback, it is possible for the system to remember the context of user feedback in terms of which other items a user was considering when making their decisions. When considering implicit user feedback, only a subset of all evaluation techniques can be used. Currently, sufficient evaluation techniques for evaluating implicit user feedback do not exist. In this thesis, I introduce a new model for recommendation that borrows the idea of opportunity cost from economics. There are two variations of the model, one considering context and one that does not. Additionally, I propose a new evaluation measure that works specifically for the case of implicit user feedback.

Contributors

Agent

Created

Date Created
  • 2012

153404-Thumbnail Image.png

The effect of image preprocessing techniques and varying JPEG quality on the identifiability of digital image splicing forgery

Description

Splicing of digital images is a powerful form of tampering which transports regions of an image to create a composite image. When used as an artistic tool, this practice is

Splicing of digital images is a powerful form of tampering which transports regions of an image to create a composite image. When used as an artistic tool, this practice is harmless but when these composite images can be used to create political associations or are submitted as evidence in the judicial system they become more impactful. In these cases, distinction between an authentic image and a tampered image can become important.

Many proposed approaches to image splicing detection follow the model of extracting features from an authentic and tampered dataset and then classifying them using machine learning with the goal of optimizing classification accuracy. This thesis approaches splicing detection from a slightly different perspective by choosing a modern splicing detection framework and examining a variety of preprocessing techniques along with their effect on classification accuracy. Preprocessing techniques explored include Joint Picture Experts Group (JPEG) file type block line blurring, image level blurring, and image level sharpening. Attention is also paid to preprocessing images adaptively based on the amount of higher frequency content they contain.

This thesis also recognizes an identified problem with using a popular tampering evaluation dataset where a mismatch in the number of JPEG processing iterations between the authentic and tampered set creates an unfair statistical bias, leading to higher detection rates. Many modern approaches do not acknowledge this issue but this thesis applies a quality factor equalization technique to reduce this bias. Additionally, this thesis artificially inserts a mismatch in JPEG processing iterations by varying amounts to determine its effect on detection rates.

Contributors

Agent

Created

Date Created
  • 2015

153583-Thumbnail Image.png

Understanding legacy workflows through runtime trace analysis

Description

When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is

When scientific software is written to specify processes, it takes the form of a workflow, and is often written in an ad-hoc manner in a dynamic programming language. There is a proliferation of legacy workflows implemented by non-expert programmers due to the accessibility of dynamic languages. Unfortunately, ad-hoc workflows lack a structured description as provided by specialized management systems, making ad-hoc workflow maintenance and reuse difficult, and motivating the need for analysis methods. The analysis of ad-hoc workflows using compiler techniques does not address dynamic languages - a program has so few constrains that its behavior cannot be predicted. In contrast, workflow provenance tracking has had success using run-time techniques to record data. The aim of this work is to develop a new analysis method for extracting workflow structure at run-time, thus avoiding issues with dynamics.

The method captures the dataflow of an ad-hoc workflow through its execution and abstracts it with a process for simplifying repetition. An instrumentation system first processes the workflow to produce an instrumented version, capable of logging events, which is then executed on an input to produce a trace. The trace undergoes dataflow construction to produce a provenance graph. The dataflow is examined for equivalent regions, which are collected into a single unit. The workflow is thus characterized in terms of its treatment of an input. Unlike other methods, a run-time approach characterizes the workflow's actual behavior; including elements which static analysis cannot predict (for example, code dynamically evaluated based on input parameters). This also enables the characterization of dataflow through external tools.

The contributions of this work are: a run-time method for recording a provenance graph from an ad-hoc Python workflow, and a method to analyze the structure of a workflow from provenance. Methods are implemented in Python and are demonstrated on real world Python workflows. These contributions enable users to derive graph structure from workflows. Empowered by a graphical view, users can better understand a legacy workflow. This makes the wealth of legacy ad-hoc workflows accessible, enabling workflow reuse instead of investing time and resources into creating a workflow.

Contributors

Agent

Created

Date Created
  • 2015