Matching Items (69)
- Genre: Masters Thesis
- Creators: Davulcu, Hasan
- Member of: Theses and Dissertations
- Resource Type: Text
Text search is a very useful way of retrieving document information from a particular website. The public generally use internet search engines over the local enterprise search engines, because the enterprise content is not cross linked and does not follow a page rank algorithm. On the other hand the enterprise search engine uses metadata information, which allows the user to specify the conditions that any retrieved document should meet. Therefore, using metadata information for searching will also be very useful. My thesis aims on developing an enterprise search engine using metadata information by providing advanced features like faceted navigation. The search engine data was extracted from various Indonesian web sources. Metadata information like person, organization, location, and sentiment analytic keyword entities should be tagged in each document to provide facet search capability. A shallow parsing technique like named entity recognizer is used for this purpose. There are more than 1500 entities that have been tagged in this process. These documents have been successfully converted into XML format and are indexed with "Apache Solr". It is an open source enterprise search engine with full text search and faceted search capabilities. The entities will be helpful for users to specify conditions and search faster through the large collection of documents. The user is assured results by clicking on a metadata condition. Since the sentiment analytic keywords are tagged with positive and negative values, social scientists can use these results to check for overlapping or conflicting organizations and ideologies. In addition, this tool is the first of its kind for the Indonesian language. The results are fetched much faster and with better accuracy.
With the advent of Internet, the data being added online is increasing at enormous rate. Though search engines are using IR techniques to facilitate the search requests from users, the results are not effective towards the search query of the user. The search engine user has to go through certain webpages before getting at the webpage he/she wanted. This problem of Information Overload can be solved using Automatic Text Summarization. Summarization is a process of obtaining at abridged version of documents so that user can have a quick view to understand what exactly the document is about. Email threads from W3C are used in this system. Apart from common IR features like Term Frequency, Inverse Document Frequency, Term Rank, a variation of page rank based on graph model, which can cluster the words with respective to word ambiguity, is implemented. Term Rank also considers the possibility of co-occurrence of words with the corpus and evaluates the rank of the word accordingly. Sentences of email threads are ranked as per features and summaries are generated. System implemented the concept of pyramid evaluation in content selection. The system can be considered as a framework for Unsupervised Learning in text summarization.
Micro-blogging platforms like Twitter have become some of the most popular sites for people to share and express their views and opinions about public events like debates, sports events or other news articles. These social updates by people complement the written news articles or transcripts of events in giving the popular public opinion about these events. So it would be useful to annotate the transcript with tweets. The technical challenge is to align the tweets with the correct segment of the transcript. ET-LDA by Hu et al  addresses this issue by modeling the whole process with an LDA-based graphical model. The system segments the transcript into coherent and meaningful parts and also determines if a tweet is a general tweet about the event or it refers to a particular segment of the transcript. One characteristic of the Hu et al’s model is that it expects all the data to be available upfront and uses batch inference procedure. But in many cases we find that data is not available beforehand, and it is often streaming. In such cases it is infeasible to repeatedly run the batch inference algorithm. My thesis presents an online inference algorithm for the ET-LDA model, with a continuous stream of tweet data and compare their runtime and performance to existing algorithms.
Browsing Twitter users, or browsers, often find it increasingly cumbersome to attach meaning to tweets that are displayed on their timeline as they follow more and more users or pages. The tweets being browsed are created by Twitter users called originators, and are of some significance to the browser who has chosen to subscribe to the tweets from the originator by following the originator. Although, hashtags are used to tag tweets in an effort to attach context to the tweets, many tweets do not have a hashtag. Such tweets are called orphan tweets and they adversely affect the experience of a browser.
A hashtag is a type of label or meta-data tag used in social networks and micro-blogging services which makes it easier for users to find messages with a specific theme or content. The context of a tweet can be defined as a set of one or more hashtags. Users often do not use hashtags to tag their tweets. This leads to the problem of missing context for tweets. To address the problem of missing hashtags, a statistical method was proposed which predicts most likely hashtags based on the social circle of an originator.
In this thesis, we propose to improve on the existing context recovery system by selectively limiting the candidate set of hashtags to be derived from the intimate circle of the originator rather than from every user in the social network of the originator. This helps in reducing the computation, increasing speed of prediction, scaling the system to originators with large social networks while still preserving most of the accuracy of the predictions. We also propose to not only derive the candidate hashtags from the social network of the originator but also derive the candidate hashtags based on the content of the tweet. We further propose to learn personalized statistical models according to the adoption patterns of different originators. This helps in not only identifying the personalized candidate set of hashtags based on the social circle and content of the tweets but also in customizing the hashtag adoption pattern to the originator of the tweet.
Techniques for supporting prediction of security breaches in critical cloud infrastructures using Bayesian network and Markov decision process
Emerging trends in cyber system security breaches in critical cloud infrastructures show that attackers have abundant resources (human and computing power), expertise and support of large organizations and possible foreign governments. In order to greatly improve the protection of critical cloud infrastructures, incorporation of human behavior is needed to predict potential security breaches in critical cloud infrastructures. To achieve such prediction, it is envisioned to develop a probabilistic modeling approach with the capability of accurately capturing system-wide causal relationship among the observed operational behaviors in the critical cloud infrastructure and accurately capturing probabilistic human (users’) behaviors on subsystems as the subsystems are directly interacting with humans. In our conceptual approach, the system-wide causal relationship can be captured by the Bayesian network, and the probabilistic human behavior in the subsystems can be captured by the Markov Decision Processes. The interactions between the dynamically changing state graphs of Markov Decision Processes and the dynamic causal relationships in Bayesian network are key components in such probabilistic modelling applications. In this thesis, two techniques are presented for supporting the above vision to prediction of potential security breaches in critical cloud infrastructures. The first technique is for evaluation of the conformance of the Bayesian network with the multiple MDPs. The second technique is to evaluate the dynamically changing Bayesian network structure for conformance with the rules of the Bayesian network using a graph checker algorithm. A case study and its simulation are presented to show how the two techniques support the specific parts in our conceptual approach to predicting system-wide security breaches in critical cloud infrastructures.
Analysis of political texts, which contains a huge amount of personal political opinions, sentiments, and emotions towards powerful individuals, leaders, organizations, and a large number of people, is an interesting task, which can lead to discover interesting interactions between the political parties and people. Recently, political blogosphere plays an increasingly important role in politics, as a forum for debating political issues. Most of the political weblogs are biased towards their political parties, and they generally express their sentiments towards their issues (i.e. leaders, topics etc.,) and also towards issues of the opposing parties. In this thesis, I have modeled the above interactions/debate as a sentimental bi-partite graph, a bi-partite graph with Blogs forming vertices of a disjoint set, and the issues (i.e. leaders, topics etc.,) forming the other disjoint set,and the edges between the two sets representing the sentiment of the blogs towards the issues. I have used American Political blog data to model the sentimental bi- partite graph, in particular, a set of popular political liberal and conservative blogs that have clearly declared positions. These blogs contain discussion about social, political, economic issues and related key individuals in their conservative/liberal view. To be more focused and more polarized, 22 most popular liberal/conservative blogs of a particular time period, May 2008 - October 2008(because of high intensity of debate and discussions), just before the presidential elections, was considered, involving around 23,800 articles. This thesis involves solving the questions: a) which is the most liberal/conservative blogs on the web? b) Who is on which side of debate and what are the issues? c) Who are the important leaders? d) How do you model the relationship between the participants of the debate and the underlying issues?
Continuous Delivery, as one of the youngest and most popular member of agile model family, has become a popular concept and method in software development industry recently. Instead of the traditional software development method, which requirements and solutions must be fixed before starting software developing, it promotes adaptive planning, evolutionary development and delivery, and encourages rapid and flexible response to change. However, several problems prevent Continuous Delivery to be introduced into education world. Taking into the consideration of the barriers, we propose a new Cloud based Continuous Delivery Software Developing System. This system is designed to fully utilize the whole life circle of software developing according to Continuous Delivery concepts in a virtualized environment in Vlab platform.
Muslim radicalism is recognized as one of the greatest security threats for the United States and the rest of the world. Use of force to eliminate specific radical entities is ineffective in containing radicalism as a whole. There is a need to understand the origin, ideologies and behavior of Radical and Counter-Radical organizations and how they shape up over a period of time. Recognizing and supporting counter-radical organizations is one of the most important steps towards impeding radical organizations. A lot of research has already been done to categorize and recognize organizations, to understand their behavior, their interactions with other organizations, their target demographics and the area of influence. We have a huge amount of information which is a result of the research done over these topics. This thesis provides a powerful and interactive way to navigate through all this information, using a Visualization Dashboard. The dashboard makes it easier for Social Scientists, Policy Analysts, Military and other personnel to visualize an organization's propensity towards violence and radicalism. It also tracks the peaking religious, political and socio-economic markers, their target demographics and locations. A powerful search interface with parametric search helps in narrowing down to specific scenarios and view the corresponding information related to the organizations. This tool helps to identify moderate Counter-Radical organizations and also has the potential of predicting the orientation of various organizations based on the current information.
In visualizing information hierarchies, icicle plots are efficient diagrams in that they provide the user a straightforward layout for different levels of data in a hierarchy and enable the user to compare items based on the item width. However, as the size of the hierarchy grows large, the items in an icicle plot end up being small and indistinguishable. In this thesis, by maintaining the positive characteristics of traditional
icicle plots and incorporating new features such as dynamic diagram and active layer, we developed an interactive visualization that allows the user to selectively drill down or roll up to review different levels of data in a large hierarchy, to change the hierarchical
structure to detect potential patterns, and to maintain an overall understanding of the
current hierarchical structure.
Analyzing the effects of Bollinger bands on the probability of stock options using support vector machines
The purpose of this research is to efficiently analyze certain data provided and to see if a useful trend can be observed as a result. This trend can be used to analyze certain probabilities. There are three main pieces of data which are being analyzed in this research: The value for δ of the call and put option, the %B value of the stock, and the amount of time until expiration of the stock option. The %B value is the most important. The purpose of analyzing the data is to see the relationship between the variables and, given certain values, what is the probability the trade makes money. This result will be used in finding the probability certain trades make money over a period of time.
Since options are so dependent on probability, this research specifically analyzes stock options rather than stocks themselves. Stock options have value like stocks except options are leveraged. The most common model used to calculate the value of an option is the Black-Scholes Model . There are five main variables the Black-Scholes Model uses to calculate the overall value of an option. These variables are θ, δ, γ, v, and ρ. The variable, θ is the rate of change in price of the option due to time decay, δ is the rate of change of the option’s price due to the stock’s changing value, γ is the rate of change of δ, v represents the rate of change of the value of the option in relation to the stock’s volatility, and ρ represents the rate of change in value of the option in relation to the interest rate . In this research, the %B value of the stock is analyzed along with the time until expiration of the option. All options have the same δ. This is due to the fact that all the options analyzed in this experiment are less than two months from expiration and the value of δ reveals how far in or out of the money an option is.
The machine learning technique used to analyze the data and the probability
is support vector machines. Support vector machines analyze data that can be classified in one of two or more groups and attempts to find a pattern in the data to develop a model, which reliably classifies similar, future data into the correct group. This is used to analyze the outcome of stock options.